Skip to content

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) #6082

Open
MichaelUray wants to merge 64 commits intonetbirdio:mainfrom
MichaelUray:pr/b-phase3.5-network-change
Open

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) #6082
MichaelUray wants to merge 64 commits intonetbirdio:mainfrom
MichaelUray:pr/b-phase3.5-network-change

Conversation

@MichaelUray
Copy link
Copy Markdown
Contributor

@MichaelUray MichaelUray commented May 5, 2026

[client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab

Branch: pr/b-phase3.5-network-change → base pr/a-p2p-dynamic-foundation (stacked PR — depends on PR-A landing first)
Compare: https://github.com/netbirdio/netbird/compare/MichaelUray:netbird:pr/a-p2p-dynamic-foundation...MichaelUray:netbird:pr/b-phase3.5-network-change?expand=1


Summary

This PR adds network-change handling, signal-trigger ICE re-attach, a short-grace post-network-change ICE-failure delay (so we don't burn the exponential backoff on the well-known "first ICE pair-check after WiFi/cellular handover fails on stale NAT mappings"), the Guard-Loop fix from the closed PR #5805, and the GUI surface for Connection Mode + timeouts in the client UI.

This is the second of four stacked PRs implementing #5989. It builds on PR-A.

Why

The Phase-1+2+3 foundation (PR-A) handles steady-state operation well, but real devices do not stay in steady state:

  • Laptops roam between WiFi and cellular.
  • Mobile devices wake from Doze/StandBy with stale NAT mappings.
  • LTE modems briefly drop and re-attach.

After such an event the existing ICE agent is bound to a departing local interface and pion's pair-checks fail. The exponential backoff (PR-A) then kicks in and the peer stays on relay for minutes.

Two related fixes are needed:

  1. Re-attach ICE after a network change — recreate the ICE worker on a fresh socket, send a new offer. (Phases 3.5, 3.7d, 3.7e.)
  2. Short retry delay during the post-network-change grace window — the first pair-check often fails on stale mappings even when subsequent ones succeed. Falling back to the normal 1-minute initial backoff after that single failure leaves the peer on relay far longer than the underlying connectivity warrants. A 60-second grace window with 2-second retries lets ~4-5 attempts run while the new mapping primes. (Phases 3.7f, 3.7h.)

Plus: previously closed PR #5805 ("Fix ICE reconnection loop and buffer candidates before agent initialization") had a known reconnect-loop where ICE worker rebuild on every signal-state change racked up offers faster than pion could complete pair-checks. The clean fix is to skip new offers while the ICE agent is still in Connecting.

What's in this PR

Phase 3.5 — Reset on network change

  • client/peer/conn: reset ICE backoff + recreate workerICE when the signal/relay layer reconnects after a network change.
  • client/peer/conn: refactor onNetworkChange to use the in-place agent recreate path; drop SendOffer from onNetworkChange to fix offer-storm; revert and replace an interim worker_ice candidate-buffering attempt with a clean approach.

Phase 3.7d — Signal-trigger re-attach

  • Re-attach ICE on every signal trigger so a peer that lost ICE while the daemon was Disconnected from signaling resumes correctly when signaling is back.

Phase 3.7e — Listener inside onNetworkChange

  • Re-attach the ICE listener inside onNetworkChange itself so a network change fires a fresh listener registration without waiting for the next activity event.

Phase 3.7f + 3.7h — Grace window

  • iceBackoff.markFailure: while inside networkChangeGracePeriod (60s) of the most recent Reset(), cap the suspend delay at networkChangeRetryDelay (2s) and do NOT advance the long-term exponential schedule. Outside the window the normal 1-minute initial backoff applies.
  • Window widened from initial 30s to 60s and delay reduced from 5s to 2s after observing real-world LTE-bounce behaviour: cold NAT mappings often need 3-4 ICE attempts to prime, and the previous 30s window only fit ~2 attempts.

Phase 3.7g — Skip ICE-Close on NC when still connected

  • If ICE is still in Connected state when the network-change fires, don't call workerICE.Close() — the existing pair is fine, just refresh the listener.

#5805 cleanup — Guard-Loop fix

  • worker_ice: skip new offers while the ICE agent is in Connecting. Drops the candidate-buffer approach the closed PR proposed in favour of a much simpler offer-skip during connecting.

GUI surface (Phase 3.7h GUI)

  • client/cmd/service: persist connection-mode + timeouts on install/reconfigure.
  • client/ui: Connection Mode + timeouts in Network tab.
  • client: surface server-pushed connection-mode/timeouts via daemon-RPC.
  • client/ui: "Follow-Server (currently: …)" display + remove the now-redundant Lazy menu.
  • client/android: gomobile getters for ConnectionMode + ServerPushed values.

Tests

  • go test ./client/internal/peer ./client/internal/peer/guard — pass (full ICE-backoff suite including the new grace-window cases).
  • go build ./client/... — pass on linux/amd64, linux/arm64, windows/amd64.
  • Hardware-validated against three OpenWrt routers (LAN host candidate, internet srflx, LTE srflx) plus a Windows 11 daemon. After a forced LTE-modem reset, the LTE peer recovers to P2P srflx within ~50 s instead of the 2-3 minutes the pre-3.7f code took.

Test plan

  • Grace window: failure inside 60s → 2s retry; failure after 60s → exponential restarts at ~1m.
  • No regression for the non-network-change failure path (still uses the exponential schedule).
  • Signal/relay reconnect drives Reset() → grace window engages.
  • Maintainer to verify: GUI changes on macOS / Linux ARM (I tested Windows + Linux/amd64 only).

Use case

A laptop that switches between docked Ethernet, office WiFi, and tethered LTE during the day. Without this PR every transition burns the exponential backoff and the user sees "stuck on relay" for 5-15 minutes. With this PR the post-handover recovery is bounded at ~50s.

Linked work

Maintainers are welcome to push directly to this branch.

Summary by CodeRabbit

  • New Features

    • Configurable Connection Modes (relay-forced, p2p, p2p-lazy, p2p-dynamic) with per-mode timeouts, local overrides, and server-pushed defaults surfaced to clients.
    • ICE failure backoff with visible status (suspended, failures, next-retry) and per-peer retry cap; settings persist via profile/install.
  • UI

    • Network/Settings UI updated to manage connection mode, timeout fields, and show server-provided hints.
  • Tests

    • Expanded coverage for mode resolution, inactivity timers, ICE backoff, and related flows.

Documentation

  • Documentation is not needed

These changes are internal lifecycle / behavioural improvements; no user-visible API or CLI flag added that warrants new public docs. Existing flags/Settings already documented at netbirdio/docs cover the surface area.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Caution

Review failed

Failed to post review comments

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a connection-mode enum and precedence (env→profile→server), per-mode timeouts and server-pushed overrides, per-peer ICE backoff and propagation, two-timer inactivity (ICE vs Relay), and end‑to‑end wiring across proto, management, daemon (Engine/ConnMgr), peer ICE lifecycle, CLI, Android, UI, status, and tests.

Changes

Connection Mode & Related Wiring (single cohort)

Layer / File(s) Summary
Data Shape / Proto / Types
shared/connectionmode/mode.go, shared/management/proto/management.proto, client/proto/daemon.proto, shared/management/http/api/openapi.yml, shared/management/http/api/types.gen.go
Introduces connectionmode.Mode enum and new fields for ConnectionMode, P2pTimeoutSeconds, RelayTimeoutSeconds, P2pRetryMaxSeconds and server_pushed_* variants; OpenAPI/types updated to replace legacy lazy flag.
Management types & audit
management/server/types/settings.go, management/server/account.go, management/server/activity/codes.go
Account Settings gain nullable ConnectionMode and timeout fields; deep-copy helpers added; audit handlers and activity codes/events added for setting changes.
Management gRPC conversion
management/internals/shared/grpc/conversion.go, management/internals/shared/grpc/conversion_test.go
toPeerConfig resolves effective ConnectionMode (legacy lazy bool + explicit setting), emits per-peer ConnectionMode and timeout fields (with disabled sentinel), and tests validate mapping.
Daemon config / Engine wiring
client/internal/connect.go, client/internal/engine.go, client/internal/profilemanager/config.go
EngineConfig and profile config accept ConnectionMode and timeout overrides; parse helper added; Engine exposes ConnMgr(); profile persistence fields and apply logic added.
ConnMgr: resolution, server-pushed, getters
client/internal/conn_mgr.go, client/internal/conn_mgr_test.go
ConnMgr stores env/cfg/server-pushed values, resolves effective mode/timeouts (env→cfg→server), applies server pushes (UpdatedRemotePeerConfig), propagates P2P retry-max to connections, and exposes getters; tests added.
Peer / ICE lifecycle & backoff
client/internal/peer/conn.go, client/internal/peer/ice_backoff.go, client/internal/peer/worker_ice.go, client/internal/peer/status.go, client/internal/peer/*_test.go
ConnConfig adds Mode and P2pRetryMaxSeconds; per-peer ICE backoff implemented (BackoffSnapshot, DefaultP2PRetryMax), Conn methods AttachICE/DetachICE/SetIceBackoffMax/IceBackoffSnapshot and handlers onICEFailed/onICEConnected added; WorkerICE reports failures/success and IsConnected; status tracks and exposes ICE backoff; tests added.
Handshaker & guard callbacks
client/internal/peer/handshaker.go, client/internal/peer/guard/guard.go, client/internal/peer/handshaker_test.go
Handshaker gains AddICEListener/RemoveICEListener/readICEListener and dispatches offers/answers via listener; Guard can register SetOnNetworkChange callback invoked on network events; tests added.
Lazy-conn inactivity (two-timer)
client/internal/lazyconn/manager/manager.go, client/internal/lazyconn/inactivity/manager.go, client/internal/lazyconn/inactivity/manager_test.go
Adds separate ICE and Relay inactivity thresholds, ICEInactiveChan/RelayInactiveChan, non-blocking notify, dual-timer selection logic with Phase‑1 compatibility aliasing, and comprehensive tests.
Daemon behavior & propagation
client/internal/conn_mgr.go, client/internal/engine.go
ConnMgr changes propagate P2P retry-max to live connections; engine calls UpdatedRemotePeerConfig with network map peer config; createPeerConn populates Mode and P2pRetryMaxSeconds from ConnMgr.
CLI / Service installer / Profile persistence
client/cmd/root.go, client/cmd/up.go, client/cmd/service.go, client/cmd/service_installer.go, client/internal/profilemanager/config.go
Adds persistent CLI flags for connection mode and timeouts (up/install/reconfigure), wires flags into setup/login/SetConfig flows, and persists changed flags to profile via profilemanager.ConfigInput.
UI / Android / Status / Debug
client/ui/client_ui.go, client/ui/event_handler.go, client/android/client.go, client/android/preferences.go, client/status/status.go, client/status/status_test.go, client/internal/debug/debug.go
UI adds Connection Mode selector, timeout entries, server-pushed hints and refresh logic; Android exposes ServerPushed getters and Preferences supports local overrides; status and debug include ICE backoff and config fields; tests updated.
Server GetConfig
client/server/server.go
GetConfigResponse augmented to include server-pushed ConnectionMode and timeout fields read from Engine.ConnMgr when available.
Tests / Coverage
multiple *_test.go files across packages
Extensive unit tests added for mode resolution, ConnMgr behavior and race-safety, two-timer inactivity, ICE backoff behavior, handshaker listeners, peer conn lifecycle, management conversion, API handlers, and settings copy semantics.

Sequence Diagram

sequenceDiagram
    participant UI as UI
    participant Daemon as Client Daemon
    participant Engine as Engine/ConnMgr
    participant Peer as Peer Conn
    participant ICE as ICE Agent
    participant Mgmt as Management Server

    UI->>Daemon: set ConnectionMode / timeouts (flags/settings)
    Daemon->>Engine: apply config / profile
    Mgmt->>Daemon: GetConfig (includes server-pushed values)
    Daemon->>Engine: receive GetConfig -> Engine.ConnMgr updated
    Engine->>Daemon: expose ServerPushed* via ConnMgr
    Engine->>Peer: update effective Mode / timeouts
    Peer->>ICE: AttachICE() (mode-dependent)
    ICE->>Peer: OnStateChange(Connected/Failed)
    alt ICE Failed
        Peer->>Peer: onICEFailed() -> mark backoff, DetachICE
        Peer->>Engine: update status (BackoffSnapshot)
        Engine->>UI: status includes iceBackoff fields
    else ICE Connected
        Peer->>Peer: onICEConnected() -> reset backoff
        Engine->>UI: status cleared/synced
    end
    Mgmt->>Engine: server-pushed update arrives
    Engine->>Peer: propagate new P2P retry max / re-arm listeners
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • netbirdio/netbird#5698: touches the same management conversion path (toPeerConfig) and related wire-format fields.
  • netbirdio/netbird#5732: modifies service installer/profile persistence similar to the new flag persistence here.
  • netbirdio/netbird#5820: overlapping changes around ICE/connection control paths and engine/handshaker behavior.

Suggested reviewers

  • mlsmaycon
  • lixmal

Poem

🐰 I hopped through modes both lazy and spry,
Server whispers shaped timeouts up high.
Backoffs counted, listeners set with care,
ICE and Relay now idle in pair.
A carrot of config, synced from core to UI.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@MichaelUray
Copy link
Copy Markdown
Contributor Author

Stack: #6081#6082 (this PR)#6083#6084. Depends on #6081 landing first.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
client/internal/peer/guard/guard.go (1)

138-147: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous callback may stall the reconnect loop

onNetworkChange() is called directly inside the select body. If the callback takes any non-trivial time — or tries to acquire a lock that the surrounding Conn code also holds during network-change processing — ctx.Done(), relayedConnDisconnected, and iCEConnDisconnected events are all frozen until it returns.

Per the PR description the callback is expected to "reset iceBackoff + recreate workerICE", which touches Conn's internal mutex and may spawn goroutines. Consider dispatching asynchronously to keep the loop responsive:

💡 Suggested fix
-		case <-srReconnectedChan:
-			g.log.Debugf("has network changes, reset reconnection ticker")
-			ticker.Stop()
-			ticker = g.newReconnectTicker(ctx)
-			tickerChannel = ticker.C
-			iceState.reset()
-			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-			if g.onNetworkChange != nil {
-				g.onNetworkChange()
-			}
+		case <-srReconnectedChan:
+			g.log.Debugf("has network changes, reset reconnection ticker")
+			ticker.Stop()
+			ticker = g.newReconnectTicker(ctx)
+			tickerChannel = ticker.C
+			iceState.reset()
+			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+			// Fire in a goroutine so the reconnect loop stays responsive to ctx.Done()
+			// and other disconnect events while the callback runs.
+			if g.onNetworkChange != nil {
+				go g.onNetworkChange()
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 138 - 147, The select case
calling g.onNetworkChange() synchronously can block the reconnect loop; change
it to invoke the callback asynchronously (e.g., spawn a goroutine) so the select
can continue handling ctx.Done(), relayedConnDisconnected and
iCEConnDisconnected immediately. Locate the case handling srReconnectedChan in
guard.go (references: srReconnectedChan, ticker, g.newReconnectTicker,
iceState.reset, g.onNetworkChange) and replace the direct call to
g.onNetworkChange() with a non-blocking dispatch (start a goroutine, optionally
recover/panic-handle and/or swap to a bounded worker or channel if you must
limit concurrency) to avoid holding up the loop.
🧹 Nitpick comments (5)
management/server/types/settings_test.go (1)

5-20: ⚡ Quick win

Consider adding a ConnectionMode deep-copy test to cover cloneStringPtr

The test validates cloneUint32Ptr via P2pRetryMaxSeconds, but cloneStringPtr (used by ConnectionMode) is a separate code path with no test coverage. A mirror test would be low-effort:

🧪 Suggested addition
+func TestSettings_Copy_ConnectionMode(t *testing.T) {
+	mode := "p2p-dynamic"
+	src := &Settings{ConnectionMode: &mode}
+	dst := src.Copy()
+	if dst.ConnectionMode == nil {
+		t.Fatal("Copy lost ConnectionMode pointer")
+	}
+	if *dst.ConnectionMode != "p2p-dynamic" {
+		t.Fatalf("expected p2p-dynamic, got %s", *dst.ConnectionMode)
+	}
+	// Verify deep copy (different pointers).
+	*dst.ConnectionMode = "relay-forced"
+	if *src.ConnectionMode != "p2p-dynamic" {
+		t.Fatal("Copy did not deep-clone ConnectionMode")
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/types/settings_test.go` around lines 5 - 20, Add a new unit
test mirroring TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy
behavior for Settings.ConnectionMode: create a source Settings with
ConnectionMode set to a string pointer, call Settings.Copy(), assert
dst.ConnectionMode is non-nil and equals the original value, then mutate
*dst.ConnectionMode and assert *src.ConnectionMode remains unchanged to verify
cloneStringPtr produced a deep copy; reference the Settings struct, its Copy
method, and cloneStringPtr in the test.
shared/management/http/api/openapi.yml (1)

357-369: ⚡ Quick win

Mark legacy lazy_connection_enabled as deprecated and state precedence.

With both fields present, migration intent is unclear unless deprecation and precedence are explicit in the schema.

🧭 Suggested compatibility clarification
         lazy_connection_enabled:
           x-experimental: true
+          deprecated: true
           description: Enables or disables experimental lazy connection
           type: boolean
           example: true
         connection_mode:
           x-experimental: true
           type: string
           enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
+            When non-NULL, `connection_mode` takes precedence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 357 - 369, Mark the
legacy boolean field lazy_connection_enabled as deprecated and make precedence
explicit: add deprecated: true to the lazy_connection_enabled schema and update
its description to indicate it is legacy and overridden by connection_mode when
connection_mode is non-null; update the connection_mode description to
explicitly state that a non-null connection_mode takes precedence over
lazy_connection_enabled and that NULL means "fall back to
lazy_connection_enabled" for backwards compatibility (use the existing keys
lazy_connection_enabled and connection_mode to locate the schema).
shared/connectionmode/mode_test.go (1)

77-94: 💤 Low value

ModeFollowServer missing from TestToLazyConnectionEnabled.

The table omits ModeFollowServer, so its ToLazyConnectionEnabled() return value is never asserted. If its behaviour changes it won't be caught. Consider adding a case (expected value is almost certainly false, matching ModeUnspecified).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/connectionmode/mode_test.go` around lines 77 - 94, The test
TestToLazyConnectionEnabled omits ModeFollowServer so its
ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry in the
cases table for ModeFollowServer with the expected boolean (likely false,
matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for
ModeFollowServer is validated like the other entries.
client/internal/peer/status.go (1)

367-382: 💤 Low value

UpdatePeerIceBackoff silently ignores unknown peer keys — inconsistent with other update methods.

Every other UpdatePeer* method (e.g., UpdatePeerState, UpdateWireGuardPeerState, UpdatePeerFQDN) returns an error when the peer is not found. UpdatePeerIceBackoff returns nothing and silently no-ops. While a silent no-op is safe here (the peer may have been removed between the ICE failure and this call), the divergence from the established pattern makes it harder to detect misconfigured call sites during development. Consider returning an error for consistency, or add a brief comment explaining why the silent no-op is intentional.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/status.go` around lines 367 - 382, UpdatePeerIceBackoff
currently silently no-ops when the peer key is missing, diverging from other
update methods like UpdatePeerState / UpdateWireGuardPeerState / UpdatePeerFQDN
which return an error; change UpdatePeerIceBackoff(pubKey string, snap
BackoffSnapshot) to return an error, lock/unlock as before, and when peer not
found return a descriptive error (e.g., fmt.Errorf("peer %s not found",
pubKey)); update callers to handle the returned error, or if you intentionally
want the silent no-op instead, add a short comment above UpdatePeerIceBackoff
explaining why ignoring missing peers is acceptable to avoid surprise.
client/internal/peer/conn_test.go (1)

374-384: 💤 Low value

Test assertion is tied to an internal error string.

Line 381 checks strings.Contains(errMsg, "workerICE") — if the error message in AttachICE is ever rewarded, this assertion passes vacuously. Consider asserting on a sentinel error value (errors.Is) instead of a substring, or at minimum add a comment acknowledging the coupling.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn_test.go` around lines 374 - 384, The test currently
asserts on the error string from c.AttachICE() by checking for "workerICE",
which is brittle; change the implementation of AttachICE to return or wrap a
sentinel error (e.g., ErrNilWorkerICE) and update the test to use errors.Is(err,
ErrNilWorkerICE) instead of substring matching; if adding a sentinel isn't
possible right now, at minimum replace the substring check with errors.Is
against an existing exported sentinel or add a clear comment documenting the
coupling. Ensure AttachICE (and any callers) wrap the sentinel with
fmt.Errorf("%w: ...", ErrNilWorkerICE) so errors.Is will work reliably.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/android/preferences.go`:
- Around line 347-352: Validate negative timeout inputs before casting to
uint32: in SetRelayTimeoutSeconds (and the other two timeout setters around the
same area) check if the incoming secs < 0 and, if so, clear the override (set
the corresponding p.configInput.*TimeoutSeconds = nil) instead of casting;
otherwise cast to uint32 and assign as before. Ensure you update all three
setters (the SetRelayTimeoutSeconds function and the two other timeout setter
functions referenced) with the same guard.

In `@client/cmd/up.go`:
- Around line 442-453: The SetConfig handler in server.go is ignoring the
incoming optional fields (req.ConnectionMode, req.RelayTimeoutSeconds,
req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so daemon-mode client overrides
are dropped; add four nil-checks in server.go's SetConfig to apply each field
when non-nil (e.g., if req.ConnectionMode != nil { serverConfig.ConnectionMode =
*req.ConnectionMode }) and persist/update the server configuration/store
accordingly; do the same for Login's handler to read and persist the same four
fields from the incoming request (setupSetConfigReq / setupLoginRequest populate
them) so the client flags are honored.

In `@client/internal/conn_mgr.go`:
- Around line 277-320: The code updates ConnMgr.relayTimeoutSecs, p2pTimeoutSecs
and p2pRetryMaxSecs before deciding whether to restart the existing lazy
manager, so changes to timeouts inside the same managed mode are ignored; fix by
capturing whether the timeout values actually changed (e.g. timeoutsChanged :=
newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs || newP2pRetry !=
e.p2pRetryMaxSecs) before overwriting the fields, then after setting e.mode and
the timeout fields, if isManaged && e.lazyConnMgr != nil && timeoutsChanged call
e.closeManager(ctx), e.statusRecorder.UpdateLazyConnection(false), and then
re-init via e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.
- Around line 176-188: When Rosenpass is enabled we must not leave e.mode set to
a lazy/dynamic variant while lazyConnMgr is not started; change ConnMgr.Start so
that if e.rosenpassEnabled is true you replace/downgrade e.mode to a non-lazy
equivalent (e.g., convert p2p-lazy/p2p-dynamic -> their non-lazy counterpart)
before returning, and also update UpdatedRemotePeerConfig to detect when newMode
is a lazy/dynamic variant while e.rosenpassEnabled is true and store the
downgraded non-lazy mode instead of newMode; reference ConnMgr.Start,
e.rosenpassEnabled, e.mode, lazyConnMgr, UpdatedRemotePeerConfig, newMode and
ActivatePeer when making this change.

In `@client/internal/lazyconn/inactivity/manager.go`:
- Around line 233-239: The scan can enqueue the same peer into both iceIdle and
relayIdle, causing races in the consumer; modify the manager scan so that when a
peer meets relayTimeout it is added to relayIdle and then skipped for ICE
notification (i.e., check relayIdle before adding to iceIdle), or alternatively
collapse to a single ordered notification list processed in priority order;
specifically update the block that uses m.relayTimeout and m.iceTimeout
(variables shown as relayIdle, iceIdle, checkTime, since) so relayIdle is
evaluated/assigned first and iceIdle insertion is gated on relayIdle not
containing peerID (or replace both maps with a single ordered notification for
the consumer).
- Around line 76-87: NewManagerWithTwoTimers currently accepts negative or
sub-minute durations; validate and clamp iceTimeout and relayTimeout the same as
the legacy constructor by enforcing MinimumInactivityThreshold (and preventing
negatives) before calling newManager. In NewManagerWithTwoTimers, if iceTimeout
or relayTimeout is less than MinimumInactivityThreshold (or negative), set them
to MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.

In `@client/internal/lazyconn/manager/manager.go`:
- Around line 103-109: The code treats iceTO==0 && relayTO==0 as a legacy "not
migrated" case; change resolvedTimeouts() to also return a migration flag (e.g.,
resolvedTimeouts() -> (iceTO, relayTO, migrated bool)) and then update the
branch in manager.go to: if !migrated { keep legacy single-timer behavior via
inactivity.NewManager(wgIface, config.InactivityThreshold) } else if iceTO == 0
&& relayTO == 0 { explicitly disable inactivity (e.g., set m.inactivityManager
to a disabled/no-op manager or nil) } else { use
inactivity.NewManagerWithTwoTimers(wgIface, iceTO, relayTO) } so intentional 0/0
does not fall back to the legacy manager; refer to resolvedTimeouts, iceTO,
relayTO, m.inactivityManager, NewManager and NewManagerWithTwoTimers when making
the change.

In `@client/internal/peer/conn.go`:
- Around line 202-212: The code currently conflates three states (disabled
sentinel, explicit 0 for "use default", and a literal non-zero seconds) so
Open() loses a prior "disabled backoff" sentinel; update the Open()/backoff init
logic and SetIceBackoffMax handling to preserve the explicit disable sentinel:
when reading conn.config.P2pRetryMaxSeconds in the backoff initialization
(backoffCap calculation and calls to newIceBackoff/SetMaxBackoff) first test for
the special disable sentinel value (e.g. ^uint32(0) or whatever sentinel you
use) and propagate that as a disabled state to conn.iceBackoff (e.g. leave
conn.iceBackoff as disabled or call a Disable/SetMaxBackoffDisabled path)
instead of treating it as a huge duration; likewise, make SetIceBackoffMax
distinguish between setting the disable sentinel, explicit 0 (meaning use
default cap), and non-zero seconds so it does not rewrite a disable sentinel
into 0 and thereby lose the disabled state across Close/Open cycles.

In `@client/internal/peer/env_test.go`:
- Line 25: Update the test case name string literal "connection_mode unparseable
falls through to legacy" to use the preferred spelling "unparsable" (i.e. change
"unparseable" → "unparsable") in the table entry currently containing
{"connection_mode unparseable falls through to legacy", "garbage", "true", "",
"", connectionmode.ModeRelayForced, 0} so that the test identifier matches
codespell expectations.

In `@client/internal/peer/status.go`:
- Around line 1372-1374: The IceBackoffNextRetry field is being set with
timestamppb.New even when peerState.IceBackoffNextRetry is the zero time, which
produces an invalid past proto timestamp; change the assignment that sets
IceBackoffNextRetry (the struct population using peerState) to guard with
peerState.IceBackoffNextRetry.IsZero() and only call timestamppb.New(...) when
non-zero, otherwise set the proto field to nil so consumers see an unset
timestamp; locate the code that sets IceBackoffNextRetry alongside
IceBackoffFailures/IceBackoffSuspended in status.go and apply this conditional.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: The read/write of lastKnownState races because
IsConnected() reads it under muxAgent while onConnectionStateChange writes it
without locking; change lastKnownState to an atomic.Int32 (or int32 accessed via
sync/atomic) and replace writes in onConnectionStateChange with
atomic.StoreInt32(&w.lastKnownState, int32(ice.ConnectionStateX)) and the read
in IsConnected() with atomic.LoadInt32(&w.lastKnownState) (while still using
muxAgent to check w.agent != nil), or alternatively guard the writes in
onConnectionStateChange with muxAgent; update all places that set/read
lastKnownState to use the chosen atomic or lock-based approach to eliminate the
data race.

In `@client/server/server.go`:
- Around line 1518-1538: The comment claiming "all zero/empty when the engine
has not received PeerConfig" is incorrect because
ConnMgr.ServerPushedP2pRetryMaxSecs() substitutes a built-in default, so
spP2pRetMax will be non‑zero; fix this by either updating the comment to reflect
the actual semantics or (preferred) changing the getter behavior: modify
ConnMgr.ServerPushedP2pRetryMaxSecs() (in conn_mgr.go) to return 0 when the
stored peer-config value is unset instead of substituting the default, and keep
default application logic elsewhere (UI or a helper), so the code in server.go
that reads spP2pRetMax from cm.ServerPushedP2pRetryMaxSecs() can reliably use 0
as the "not yet received" sentinel.

In `@client/ui/client_ui.go`:
- Around line 656-666: parseUint32Field currently coerces empty, non-numeric,
and overflowed input to 0 which lets buildSetConfigRequest silently clear
overrides; change parseUint32Field to return (uint32, error) (or (uint32, bool)
if you prefer) and treat empty input as a distinct "no value" case rather than
returning 0, return a parse/overflow error for invalid inputs, and then update
buildSetConfigRequest (and the other caller around lines 741-750) to
propagate/handle the error: reject the request or show validation feedback
instead of sending 0 to the daemon. Ensure you validate numeric range (32-bit)
and non-digit characters in parseUint32Field and adjust all call sites to handle
the new return signature.

In `@management/internals/shared/grpc/conversion.go`:
- Around line 115-119: When parsing Settings.ConnectionMode inside the if
settings.ConnectionMode != nil block use
connectionmode.ParseString(*settings.ConnectionMode) as before but if it returns
an error emit a one-line warning indicating the bad value so admin typos are
visible; update the code around connectionmode.ParseString to log a warn (e.g.,
logger.Warnf or the package logger) with the invalid string and the parse error
when err != nil, otherwise keep the existing behavior of assigning resolvedMode
when m != connectionmode.ModeUnspecified.
- Around line 26-29: The initial peer connection path doesn't translate the
wire-format sentinel (p2pRetryMaxDisabledSentinel) to zero, so Conn.Open()
computes backoffCap from conn.config.P2pRetryMaxSeconds and treats the sentinel
as a huge duration; update Conn.Open() to detect if
conn.config.P2pRetryMaxSeconds equals p2pRetryMaxDisabledSentinel and treat it
as 0 (or alternatively perform the same sentinel→0 conversion earlier in
resolveConnectionMode() / engine.createPeerConnConfig() /
ConnMgr.P2pRetryMax()), then compute backoffCap :=
time.Duration(uint32ValueConvertedToZeroIfSentinel) * time.Second so sentinel
yields duration 0.

In `@management/server/account.go`:
- Around line 372-375: The change handlers currently only audit connection-mode
related updates but do not push new runtime config to live peers; update the
caller to trigger am.updateAccountPeers when any of ConnectionMode,
RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change.
Concretely: modify handleConnectionModeSettings (or have the handlers return a
bool) so the outer code that calls am.handleConnectionModeSettings(...) can
detect a change and then call am.updateAccountPeers(ctx, userID, accountID) (or
the existing updateAccountPeers signature) after the settings are saved/audited;
ensure the check compares oldSettings.ConnectionMode,
oldSettings.RelayTimeoutSeconds, oldSettings.P2pTimeoutSeconds, and
oldSettings.P2pRetryMaxSeconds against the newSettings versions and only calls
updateAccountPeers when any differ.

In `@management/server/http/handlers/accounts/accounts_handler.go`:
- Around line 229-240: The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.

In `@shared/management/http/api/openapi.yml`:
- Around line 367-372: The description for the "connection_mode" enum is
outdated and still states that "p2p-dynamic" is reserved/passes through; update
the OpenAPI description for the connection_mode field so it reflects the current
phase behavior (remove the claim that p2p-dynamic is reserved/pass-through) and
clearly document which modes are currently functional (relay-forced, p2p,
p2p-lazy) and how NULL interacts with lazy_connection_enabled for backwards
compatibility; edit the description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.

---

Outside diff comments:
In `@client/internal/peer/guard/guard.go`:
- Around line 138-147: The select case calling g.onNetworkChange() synchronously
can block the reconnect loop; change it to invoke the callback asynchronously
(e.g., spawn a goroutine) so the select can continue handling ctx.Done(),
relayedConnDisconnected and iCEConnDisconnected immediately. Locate the case
handling srReconnectedChan in guard.go (references: srReconnectedChan, ticker,
g.newReconnectTicker, iceState.reset, g.onNetworkChange) and replace the direct
call to g.onNetworkChange() with a non-blocking dispatch (start a goroutine,
optionally recover/panic-handle and/or swap to a bounded worker or channel if
you must limit concurrency) to avoid holding up the loop.

---

Nitpick comments:
In `@client/internal/peer/conn_test.go`:
- Around line 374-384: The test currently asserts on the error string from
c.AttachICE() by checking for "workerICE", which is brittle; change the
implementation of AttachICE to return or wrap a sentinel error (e.g.,
ErrNilWorkerICE) and update the test to use errors.Is(err, ErrNilWorkerICE)
instead of substring matching; if adding a sentinel isn't possible right now, at
minimum replace the substring check with errors.Is against an existing exported
sentinel or add a clear comment documenting the coupling. Ensure AttachICE (and
any callers) wrap the sentinel with fmt.Errorf("%w: ...", ErrNilWorkerICE) so
errors.Is will work reliably.

In `@client/internal/peer/status.go`:
- Around line 367-382: UpdatePeerIceBackoff currently silently no-ops when the
peer key is missing, diverging from other update methods like UpdatePeerState /
UpdateWireGuardPeerState / UpdatePeerFQDN which return an error; change
UpdatePeerIceBackoff(pubKey string, snap BackoffSnapshot) to return an error,
lock/unlock as before, and when peer not found return a descriptive error (e.g.,
fmt.Errorf("peer %s not found", pubKey)); update callers to handle the returned
error, or if you intentionally want the silent no-op instead, add a short
comment above UpdatePeerIceBackoff explaining why ignoring missing peers is
acceptable to avoid surprise.

In `@management/server/types/settings_test.go`:
- Around line 5-20: Add a new unit test mirroring
TestSettings_Copy_P2pRetryMaxSeconds to validate deep-copy behavior for
Settings.ConnectionMode: create a source Settings with ConnectionMode set to a
string pointer, call Settings.Copy(), assert dst.ConnectionMode is non-nil and
equals the original value, then mutate *dst.ConnectionMode and assert
*src.ConnectionMode remains unchanged to verify cloneStringPtr produced a deep
copy; reference the Settings struct, its Copy method, and cloneStringPtr in the
test.

In `@shared/connectionmode/mode_test.go`:
- Around line 77-94: The test TestToLazyConnectionEnabled omits ModeFollowServer
so its ToLazyConnectionEnabled() behavior isn't asserted; add a test case entry
in the cases table for ModeFollowServer with the expected boolean (likely false,
matching ModeUnspecified) so the call to Mode.ToLazyConnectionEnabled() for
ModeFollowServer is validated like the other entries.

In `@shared/management/http/api/openapi.yml`:
- Around line 357-369: Mark the legacy boolean field lazy_connection_enabled as
deprecated and make precedence explicit: add deprecated: true to the
lazy_connection_enabled schema and update its description to indicate it is
legacy and overridden by connection_mode when connection_mode is non-null;
update the connection_mode description to explicitly state that a non-null
connection_mode takes precedence over lazy_connection_enabled and that NULL
means "fall back to lazy_connection_enabled" for backwards compatibility (use
the existing keys lazy_connection_enabled and connection_mode to locate the
schema).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f6253fba-a00e-49ab-b17b-28539d6a1fa2

📥 Commits

Reviewing files that changed from the base of the PR and between b19b746 and 2b6f78b.

⛔ Files ignored due to path filters (4)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • client/proto/daemon_grpc.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/proxy_service.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (46)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/root.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/cmd/up.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/connect.go
  • client/internal/engine.go
  • client/internal/lazyconn/env.go
  • client/internal/lazyconn/inactivity/manager.go
  • client/internal/lazyconn/inactivity/manager_test.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/conn_test.go
  • client/internal/peer/env.go
  • client/internal/peer/env_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/handshaker_test.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/status.go
  • client/internal/peer/worker_ice.go
  • client/internal/profilemanager/config.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/status/status.go
  • client/status/status_test.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
  • management/internals/shared/grpc/conversion.go
  • management/internals/shared/grpc/conversion_test.go
  • management/server/account.go
  • management/server/activity/codes.go
  • management/server/http/handlers/accounts/accounts_handler.go
  • management/server/http/handlers/accounts/accounts_handler_test.go
  • management/server/types/settings.go
  • management/server/types/settings_test.go
  • shared/connectionmode/mode.go
  • shared/connectionmode/mode_test.go
  • shared/management/http/api/openapi.yml
  • shared/management/http/api/types.gen.go
  • shared/management/proto/management.proto
💤 Files with no reviewable changes (2)
  • client/ui/event_handler.go
  • client/ui/const.go

Comment on lines +347 to +352
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative timeout inputs before casting to uint32.

These setters currently turn negative Android values into huge positive timeouts (-1 becomes 4294967295), which silently corrupts the persisted override instead of clearing/rejecting it. Clamp or reject < 0 before the cast for all three setters.

🛠️ Suggested fix
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 368-373, 388-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 347 - 352, Validate negative
timeout inputs before casting to uint32: in SetRelayTimeoutSeconds (and the
other two timeout setters around the same area) check if the incoming secs < 0
and, if so, clear the override (set the corresponding
p.configInput.*TimeoutSeconds = nil) instead of casting; otherwise cast to
uint32 and assign as before. Ensure you update all three setters (the
SetRelayTimeoutSeconds function and the two other timeout setter functions
referenced) with the same guard.

Comment thread client/cmd/up.go
Comment on lines +442 to +453
if cmd.Flag(connectionModeFlag).Changed {
req.ConnectionMode = &connectionMode
}
if cmd.Flag(relayTimeoutFlag).Changed {
req.RelayTimeoutSeconds = &relayTimeoutSecs
}
if cmd.Flag(p2pTimeoutFlag).Changed {
req.P2PTimeoutSeconds = &p2pTimeoutSecs
}
if cmd.Flag(p2pRetryMaxFlag).Changed {
req.P2PRetryMaxSeconds = &p2pRetryMaxSecs
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

New connection-mode flags are silently dropped in daemon mode.

setupSetConfigReq populates req.ConnectionMode, req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, and req.P2PRetryMaxSeconds, but server.go's SetConfig handler never reads or persists these four fields from the incoming *proto.SetConfigRequest. In daemon mode the client.SetConfig call at runInDaemonMode succeeds with no error, but the values are silently discarded — the user's --connection-mode override never takes effect.

The same applies to setupLoginRequest (lines 699-710): the populated fields are carried over the wire but server.go's Login handler does not persist them either.

The fix is to extract and apply these fields inside server.go's SetConfig:

🐛 Proposed fix in server.go `SetConfig`
 	if msg.Mtu != nil {
 		mtu := uint16(*msg.Mtu)
 		config.MTU = &mtu
 	}
+
+	if msg.ConnectionMode != nil {
+		config.ConnectionMode = *msg.ConnectionMode
+	}
+	if msg.RelayTimeoutSeconds != nil {
+		config.RelayTimeoutSeconds = *msg.RelayTimeoutSeconds
+	}
+	if msg.P2PTimeoutSeconds != nil {
+		config.P2pTimeoutSeconds = *msg.P2PTimeoutSeconds
+	}
+	if msg.P2PRetryMaxSeconds != nil {
+		config.P2pRetryMaxSeconds = *msg.P2PRetryMaxSeconds
+	}

 	if _, err := profilemanager.UpdateConfig(config); err != nil {

Wait - this is actually not high effort. The fix is 4 simple nil-checks in one location.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/up.go` around lines 442 - 453, The SetConfig handler in server.go
is ignoring the incoming optional fields (req.ConnectionMode,
req.RelayTimeoutSeconds, req.P2PTimeoutSeconds, req.P2PRetryMaxSeconds) so
daemon-mode client overrides are dropped; add four nil-checks in server.go's
SetConfig to apply each field when non-nil (e.g., if req.ConnectionMode != nil {
serverConfig.ConnectionMode = *req.ConnectionMode }) and persist/update the
server configuration/store accordingly; do the same for Login's handler to read
and persist the same four fields from the incoming request (setupSetConfigReq /
setupLoginRequest populate them) so the client flags are honored.

Comment thread client/internal/conn_mgr.go
Comment on lines +277 to +320
newMode, newRelay, newP2P, newP2pRetry := resolveConnectionMode(
e.envMode, e.envRelayTimeout, e.cfgMode, e.cfgRelayTimeout,
e.cfgP2pTimeout, e.cfgP2pRetryMax, pc,
)

if newMode == e.mode && newRelay == e.relayTimeoutSecs &&
newP2P == e.p2pTimeoutSecs && newP2pRetry == e.p2pRetryMaxSecs {
return nil
}
prev := e.mode
e.mode = newMode
e.relayTimeoutSecs = newRelay
e.p2pTimeoutSecs = newP2P
e.p2pRetryMaxSecs = newP2pRetry
e.propagateP2pRetryMaxToConns()

wasManaged := modeUsesLazyMgr(prev)
isManaged := modeUsesLazyMgr(newMode)
modeChanged := prev != newMode

if modeChanged && wasManaged && !isManaged {
log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
return nil
}

if modeChanged && wasManaged && isManaged {
// Switching between lazy and dynamic at runtime: tear down the
// existing manager so initLazyManager picks up the new timeouts.
log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
e.closeManager(ctx)
e.statusRecorder.UpdateLazyConnection(false)
}

if isManaged && e.lazyConnMgr == nil {
if e.rosenpassEnabled {
log.Warnf("rosenpass enabled, ignoring lazy/dynamic mode push")
return nil
}
log.Infof("lazy/dynamic connection manager enabled by management push (mode=%s)", newMode)
e.initLazyManager(ctx)
e.startModeSideEffects()
return e.addPeersToLazyConnManager()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restart the lazy manager when timeouts change inside the same managed mode.

relayTimeoutSecs and p2pTimeoutSecs are only consumed in initLazyManager(). Right now a server push from, say, p2p-dynamic(60/30) to p2p-dynamic(10/5) updates the fields on ConnMgr but keeps the existing manager instance alive, so the inactivity thresholds never actually change until a full restart or a mode flip.

Suggested fix
-	if modeChanged && wasManaged && isManaged {
+	timeoutsChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
+	if wasManaged && isManaged && (modeChanged || timeoutsChanged) {
 		// Switching between lazy and dynamic at runtime: tear down the
 		// existing manager so initLazyManager picks up the new timeouts.
-		log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
+		log.Infof("restarting lazy/dynamic manager to apply updated mode/timeouts (%s -> %s)", prev, newMode)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 	}

Capture timeoutsChanged before overwriting the old timeout fields so the comparison is against the previous manager config.

Also applies to: 553-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 277 - 320, The code updates
ConnMgr.relayTimeoutSecs, p2pTimeoutSecs and p2pRetryMaxSecs before deciding
whether to restart the existing lazy manager, so changes to timeouts inside the
same managed mode are ignored; fix by capturing whether the timeout values
actually changed (e.g. timeoutsChanged := newRelay != e.relayTimeoutSecs ||
newP2P != e.p2pTimeoutSecs || newP2pRetry != e.p2pRetryMaxSecs) before
overwriting the fields, then after setting e.mode and the timeout fields, if
isManaged && e.lazyConnMgr != nil && timeoutsChanged call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), and then re-init via
e.initLazyManager(ctx) followed by e.startModeSideEffects() and
e.addPeersToLazyConnManager(); keep existing logic for mode flips using
modeUsesLazyMgr, and ensure propagateP2pRetryMaxToConns() is still called after
updating the retry field.

Comment on lines +76 to +87
// NewManagerWithTwoTimers is the Phase-2 constructor. Pass 0 for either
// timeout to disable that teardown path. Both 0 leaves the manager
// running but inert (no channel ever fires) -- used by p2p / relay-forced
// modes that don't tear down workers.
func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
if iceTimeout > 0 {
log.Infof("ICE inactivity timeout: %v", iceTimeout)
}
if relayTimeout > 0 {
log.Infof("relay inactivity timeout: %v", relayTimeout)
}
return newManager(iface, iceTimeout, relayTimeout)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate two-timer thresholds the same way the legacy constructor does.

NewManagerWithTwoTimers bypasses MinimumInactivityThreshold, so negative or sub-minute values are accepted on the new path. With a 1-minute polling loop, that makes small values misleading and negative values effectively mark every peer idle on the first tick.

🛠️ Suggested fix
 func NewManagerWithTwoTimers(iface WgInterface, iceTimeout, relayTimeout time.Duration) *Manager {
+	validateOptional := func(name string, v time.Duration) time.Duration {
+		if v == 0 {
+			return 0
+		}
+		if v < MinimumInactivityThreshold {
+			log.Warnf("%s inactivity timeout %v is too low, using minimum %v", name, v, MinimumInactivityThreshold)
+			return MinimumInactivityThreshold
+		}
+		return v
+	}
+
+	iceTimeout = validateOptional("ICE", iceTimeout)
+	relayTimeout = validateOptional("relay", relayTimeout)
+
 	if iceTimeout > 0 {
 		log.Infof("ICE inactivity timeout: %v", iceTimeout)
 	}
 	if relayTimeout > 0 {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/inactivity/manager.go` around lines 76 - 87,
NewManagerWithTwoTimers currently accepts negative or sub-minute durations;
validate and clamp iceTimeout and relayTimeout the same as the legacy
constructor by enforcing MinimumInactivityThreshold (and preventing negatives)
before calling newManager. In NewManagerWithTwoTimers, if iceTimeout or
relayTimeout is less than MinimumInactivityThreshold (or negative), set them to
MinimumInactivityThreshold, log the adjusted values for visibility (e.g.,
"adjusted ICE/relay inactivity timeout"), and then pass the sanitized timeouts
into newManager so both constructors behave identically.

Comment thread management/internals/shared/grpc/conversion.go
Comment on lines +115 to +119
if settings.ConnectionMode != nil {
if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Log when Settings.ConnectionMode fails to parse so admin typos are surfaced.

Right now an invalid value (say "p2p-typoed") is silently dropped — the function falls back to the legacy bool with no signal at all. A misconfigured admin will see "old" behavior with no clue why. A one-line warn log on the parse error is cheap and matches what the env-resolver in client/internal/peer/env.go already does for EnvKeyNBConnectionMode.

📝 Suggested change
 	resolvedMode := connectionmode.ResolveLegacyLazyBool(settings.LazyConnectionEnabled)
 	if settings.ConnectionMode != nil {
-		if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
+		m, err := connectionmode.ParseString(*settings.ConnectionMode)
+		if err != nil {
+			log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err)
+		} else if m != connectionmode.ModeUnspecified {
 			resolvedMode = m
 		}
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if settings.ConnectionMode != nil {
if m, err := connectionmode.ParseString(*settings.ConnectionMode); err == nil && m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
if settings.ConnectionMode != nil {
m, err := connectionmode.ParseString(*settings.ConnectionMode)
if err != nil {
log.Warnf("ignoring unknown Settings.ConnectionMode %q: %v", *settings.ConnectionMode, err)
} else if m != connectionmode.ModeUnspecified {
resolvedMode = m
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/internals/shared/grpc/conversion.go` around lines 115 - 119, When
parsing Settings.ConnectionMode inside the if settings.ConnectionMode != nil
block use connectionmode.ParseString(*settings.ConnectionMode) as before but if
it returns an error emit a one-line warning indicating the bad value so admin
typos are visible; update the code around connectionmode.ParseString to log a
warn (e.g., logger.Warnf or the package logger) with the invalid string and the
parse error when err != nil, otherwise keep the existing behavior of assigning
resolvedMode when m != connectionmode.ModeUnspecified.

Comment on lines 372 to 375
am.handleRoutingPeerDNSResolutionSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handleLazyConnectionSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handleConnectionModeSettings(ctx, oldSettings, newSettings, userID, accountID)
am.handlePeerLoginExpirationSettings(ctx, oldSettings, newSettings, userID, accountID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate connection-mode setting changes to peers, not just audit them.

This new audit hook lands, but updateAccountPeers is never flipped when ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or P2pRetryMaxSeconds change. As written, the account setting is saved and audited, but already-connected clients won't be pushed the new runtime config until some unrelated peer update happens.

🛠️ Suggested fix
 		if oldSettings.RoutingPeerDNSResolutionEnabled != newSettings.RoutingPeerDNSResolutionEnabled ||
 			oldSettings.LazyConnectionEnabled != newSettings.LazyConnectionEnabled ||
+			!equalStringPtr(oldSettings.ConnectionMode, newSettings.ConnectionMode) ||
+			!equalUint32Ptr(oldSettings.RelayTimeoutSeconds, newSettings.RelayTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pTimeoutSeconds, newSettings.P2pTimeoutSeconds) ||
+			!equalUint32Ptr(oldSettings.P2pRetryMaxSeconds, newSettings.P2pRetryMaxSeconds) ||
 			oldSettings.DNSDomain != newSettings.DNSDomain ||
 			oldSettings.AutoUpdateVersion != newSettings.AutoUpdateVersion ||
 			oldSettings.AutoUpdateAlways != newSettings.AutoUpdateAlways {
 			updateAccountPeers = true
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/account.go` around lines 372 - 375, The change handlers
currently only audit connection-mode related updates but do not push new runtime
config to live peers; update the caller to trigger am.updateAccountPeers when
any of ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds, or
P2pRetryMaxSeconds change. Concretely: modify handleConnectionModeSettings (or
have the handlers return a bool) so the outer code that calls
am.handleConnectionModeSettings(...) can detect a change and then call
am.updateAccountPeers(ctx, userID, accountID) (or the existing
updateAccountPeers signature) after the settings are saved/audited; ensure the
check compares oldSettings.ConnectionMode, oldSettings.RelayTimeoutSeconds,
oldSettings.P2pTimeoutSeconds, and oldSettings.P2pRetryMaxSeconds against the
newSettings versions and only calls updateAccountPeers when any differ.

Comment on lines +229 to +240
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject negative timeout values before converting them to uint32.

These casts accept negative JSON values and wrap them into huge unsigned numbers, so a request like -1 becomes 4294967295 seconds instead of a validation error. Please validate all three timeout fields as >= 0 before the cast.

🛠️ Suggested fix
 	if req.Settings.P2pTimeoutSeconds != nil {
+		if *req.Settings.P2pTimeoutSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.P2pTimeoutSeconds)
 		returnSettings.P2pTimeoutSeconds = &v
 	}
 	if req.Settings.P2pRetryMaxSeconds != nil {
+		if *req.Settings.P2pRetryMaxSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.P2pRetryMaxSeconds)
 		returnSettings.P2pRetryMaxSeconds = &v
 	}
 	if req.Settings.RelayTimeoutSeconds != nil {
+		if *req.Settings.RelayTimeoutSeconds < 0 {
+			return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0")
+		}
 		v := uint32(*req.Settings.RelayTimeoutSeconds)
 		returnSettings.RelayTimeoutSeconds = &v
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if req.Settings.P2pTimeoutSeconds != nil {
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
if req.Settings.P2pTimeoutSeconds != nil {
if *req.Settings.P2pTimeoutSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "p2p_timeout_seconds must be >= 0")
}
v := uint32(*req.Settings.P2pTimeoutSeconds)
returnSettings.P2pTimeoutSeconds = &v
}
if req.Settings.P2pRetryMaxSeconds != nil {
if *req.Settings.P2pRetryMaxSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "p2p_retry_max_seconds must be >= 0")
}
v := uint32(*req.Settings.P2pRetryMaxSeconds)
returnSettings.P2pRetryMaxSeconds = &v
}
if req.Settings.RelayTimeoutSeconds != nil {
if *req.Settings.RelayTimeoutSeconds < 0 {
return nil, status.Errorf(status.InvalidArgument, "relay_timeout_seconds must be >= 0")
}
v := uint32(*req.Settings.RelayTimeoutSeconds)
returnSettings.RelayTimeoutSeconds = &v
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@management/server/http/handlers/accounts/accounts_handler.go` around lines
229 - 240, The three timeout fields (req.Settings.P2pTimeoutSeconds,
P2pRetryMaxSeconds, RelayTimeoutSeconds) are being cast directly to uint32 which
allows negative JSON numbers to wrap to large unsigned values; update the
validation before casting in the handler that populates returnSettings: check
each pointer is non-nil and *value >= 0, and if a negative value is detected
return a validation/error response (or handle it per existing validation flow)
instead of casting; then safely convert to uint32 (e.g., v :=
uint32(*req.Settings.P2pTimeoutSeconds)) and assign to
returnSettings.P2pTimeoutSeconds, P2pRetryMaxSeconds, and RelayTimeoutSeconds
respectively.

Comment on lines +367 to +372
description: |
Account-wide default peer-connection mode. NULL means
"fall back to lazy_connection_enabled" for backwards compatibility.
Phase 1 of issue #5989: relay-forced, p2p, and p2p-lazy are
functional. p2p-dynamic is reserved (passes through as p2p in
Phase 1; will become functional in Phase 2).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale connection_mode phase semantics in the description.

This text says p2p-dynamic is reserved/pass-through, which conflicts with the current phase scope and can mislead API consumers.

✏️ Suggested doc update
         connection_mode:
           x-experimental: true
           type: string
           enum: [relay-forced, p2p, p2p-lazy, p2p-dynamic]
           nullable: true
           description: |
             Account-wide default peer-connection mode. NULL means
             "fall back to lazy_connection_enabled" for backwards compatibility.
-            Phase 1 of issue `#5989`: relay-forced, p2p, and p2p-lazy are
-            functional. p2p-dynamic is reserved (passes through as p2p in
-            Phase 1; will become functional in Phase 2).
+            relay-forced, p2p, p2p-lazy, and p2p-dynamic are supported.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@shared/management/http/api/openapi.yml` around lines 367 - 372, The
description for the "connection_mode" enum is outdated and still states that
"p2p-dynamic" is reserved/passes through; update the OpenAPI description for the
connection_mode field so it reflects the current phase behavior (remove the
claim that p2p-dynamic is reserved/pass-through) and clearly document which
modes are currently functional (relay-forced, p2p, p2p-lazy) and how NULL
interacts with lazy_connection_enabled for backwards compatibility; edit the
description text near the connection_mode property in
shared/management/http/api/openapi.yml and mention the accurate phase semantics
for p2p-dynamic (omit or mark as not supported rather than pass-through) and
ensure references to lazy_connection_enabled and Phase 1/Phase 2 are consistent.

@MichaelUray MichaelUray changed the title [2/4 stacked] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab [client] Phase 3.5 + 3.7d-h of #5989: network-change ICE recovery + GUI mode tab (stack 2/4) May 6, 2026
@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from 2b6f78b to 6bf6777 Compare May 6, 2026 06:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
client/internal/peer/worker_ice.go (1)

218-231: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

lastKnownState is still accessed without consistent synchronization — data race.

IsConnected() reads lastKnownState while holding muxAgent, but the writes inside onConnectionStateChange at lines 550 and 562 happen on the pion callback goroutine without acquiring muxAgent. This is a Go data race, and the new Conn.onNetworkChange path (in conn.go at line 1244) now reads through IsConnected() to decide whether to close the agent — a torn read can either skip Close() on a freshly-disconnected agent or close a still-Connected one, exactly the scenario the comment in onNetworkChange says to avoid.

Switch lastKnownState to atomic.Int32 (or similar) so reads and writes are consistent regardless of muxAgent:

-	// we record the last known state of the ICE agent to avoid duplicate on disconnected events
-	lastKnownState ice.ConnectionState
+	// Accessed from pion's state-change callback and from IsConnected()
+	// concurrently — keep atomic so reads on the network-change path are
+	// not torn against unlocked writes in onConnectionStateChange.
+	lastKnownState atomic.Int32 // ice.ConnectionState
-		lastKnownState:    ice.ConnectionStateDisconnected,
+	}
+	w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected))
 func (w *WorkerICE) IsConnected() bool {
 	w.muxAgent.Lock()
 	defer w.muxAgent.Unlock()
-	return w.agent != nil && w.lastKnownState == ice.ConnectionStateConnected
+	return w.agent != nil && ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected
 }
 		case ice.ConnectionStateConnected:
-			w.lastKnownState = ice.ConnectionStateConnected
+			w.lastKnownState.Store(int32(ice.ConnectionStateConnected))
 			w.logSuccessfulPaths(agent)
@@
-			if w.lastKnownState == ice.ConnectionStateConnected {
-				w.lastKnownState = ice.ConnectionStateDisconnected
+			if ice.ConnectionState(w.lastKnownState.Load()) == ice.ConnectionStateConnected {
+				w.lastKnownState.Store(int32(ice.ConnectionStateDisconnected))
 				w.conn.onICEStateDisconnected(sessionChanged)
 			}

Also applies to: 549-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 231, IsConnected reads
lastKnownState under muxAgent but onConnectionStateChange writes it without
locking, causing a data race; change the WorkerICE.lastKnownState from a plain
int to an atomic.Int32 (or similar), update IsConnected() to use
lastKnownState.Load() (while keeping muxAgent for agent nil-check), and update
onConnectionStateChange to use lastKnownState.Store(newState) instead of
assigning directly; ensure all other accesses (e.g., in the code paths around
lines referenced) use atomic Load/Store and remove any unsynchronized direct
reads/writes to lastKnownState while leaving agent access protected by muxAgent
as before.
client/internal/peer/conn.go (1)

202-212: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Disable-backoff state is still not preserved across Open/Close cycles.

The ConnConfig comment at lines 98-100 says ^uint32(0) is the disable sentinel and the resolver translates it to 0 before reaching here, while the Open() block treats 0 as "use built-in default" (DefaultP2PRetryMax). These two semantics conflict: a peer that the resolver wanted disabled ends up with the default 15 m backoff. Likewise, SetIceBackoffMax(0) writes 0 into config.P2pRetryMaxSeconds, and a subsequent Close()/Open() cycle silently re-promotes the peer from "disabled" to the default schedule.

Use a distinct in-process representation for "disabled" so the three states (use default, disabled, explicit value) can each round-trip. For example, treat the wire ^uint32(0) as the in-process "disabled" sentinel rather than collapsing it to 0 at the resolver, so Open() can branch on it directly.

+const p2pRetryMaxDisabled = ^uint32(0)
+
-	backoffCap := time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
-	if backoffCap == 0 {
-		backoffCap = DefaultP2PRetryMax
-	}
+	var backoffCap time.Duration
+	switch conn.config.P2pRetryMaxSeconds {
+	case 0:
+		backoffCap = DefaultP2PRetryMax
+	case p2pRetryMaxDisabled:
+		backoffCap = 0
+	default:
+		backoffCap = time.Duration(conn.config.P2pRetryMaxSeconds) * time.Second
+	}
@@
 func (conn *Conn) SetIceBackoffMax(d time.Duration) {
 	conn.mu.Lock()
 	defer conn.mu.Unlock()
-	conn.config.P2pRetryMaxSeconds = uint32(d / time.Second)
+	if d == 0 {
+		conn.config.P2pRetryMaxSeconds = p2pRetryMaxDisabled
+	} else {
+		conn.config.P2pRetryMaxSeconds = uint32(d / time.Second)
+	}
 	if conn.iceBackoff != nil {
 		conn.iceBackoff.SetMaxBackoff(d)
 	}
 }

The companion change in the resolver/engine.go must also stop translating ^uint32(0)0 — pass the sentinel through verbatim.

Also applies to: 1152-1163

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 202 - 212, The code collapses the
wire sentinel for "disabled" into 0 which conflicts with Open() treating 0 as
"use default", causing disabled peers to be re-enabled; fix by introducing a
distinct in-process sentinel for "disabled" (e.g. a named constant like
DisabledP2pRetrySentinel) and preserve the resolver wire sentinel instead of
translating it to 0 in resolver/engine.go, then update Open()'s logic around
conn.config.P2pRetryMaxSeconds to: if it equals the disabled sentinel, configure
conn.iceBackoff as disabled (or leave nil) so backoff is turned off; else if it
equals 0 treat it as "use default" and set backoffCap = DefaultP2PRetryMax,
otherwise use the explicit value to newIceBackoff / SetMaxBackoff; also ensure
SetIceBackoffMax writes the same in-process sentinel when disabling so
Close()/Open() round-trips.
client/android/preferences.go (1)

349-352: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Negative secs is silently aliased to a huge uint32; for SetP2pRetryMaxSeconds it collides with the disable sentinel.

A SetRelayTimeoutSeconds(-1) becomes 4294967295 seconds (~136 years), and the same bad cast on SetP2pRetryMaxSeconds(-1) yields exactly ^uint32(0), which is the documented wire-format sentinel for "user-explicit disable backoff" (see client/internal/peer/conn.go lines 98-100). Negative input therefore silently flips backoff into the disabled state without any validation. Clamp or reject < 0 in all three setters.

 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs < 0 {
+		secs = 0
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 370-373, 390-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 349 - 352, SetRelayTimeoutSeconds
(and the two other setters that cast int64 to uint32, e.g.,
SetP2pRetryMaxSeconds and the comparable setter) currently cast negative secs
into a huge uint32 (which can collide with the disable sentinel); update each
setter to validate secs before casting: if secs < 0 then clamp to 0 (or return
an error if you prefer API change) and only then assign via uint32(secs) into
the pointer field so negative inputs cannot become ^uint32(0) or other
unintended values.
🧹 Nitpick comments (6)
client/internal/peer/guard/guard.go (1)

57-62: ⚡ Quick win

SetOnNetworkChange after Start() is an undetected data race

The comment says "Set once before Start() is called; no lock needed." This is safe only if the goroutine spawning Start() happens strictly after SetOnNetworkChange returns, establishing a happens-before edge. However, nothing in the API enforces this — a caller that invokes SetOnNetworkChange after spawning go g.Start(...) introduces a data race on g.onNetworkChange with no diagnostic. Consider using sync/atomic to store the function pointer, or document this precondition with a //nolint:gocritic + go vet note so it surfaces in race-detector runs.

As a lightweight alternative, a sync/atomic.Pointer[func()] store/load pair makes the intent explicit and race-detector-clean with minimal overhead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 57 - 62, SetOnNetworkChange
currently writes g.onNetworkChange without synchronization which can race with
concurrent Start() readers; change the Guard implementation to store the
callback via a sync/atomic pointer type (e.g. sync/atomic.Pointer[func()])
instead of the plain field and update all reads (e.g. in Start and any other
goroutines that invoke the callback) to load via atomic.Load/Pointer (or the
generic Atomic.Pointer API) before invoking; ensure SetOnNetworkChange uses
atomic.Store/Pointer to set the callback and nil-checks use the atomic load so
access is race-detector-clean and retains the same semantics.
client/internal/peer/ice_backoff.go (2)

82-92: 💤 Low value

Optional: clear suspended flag on lazy expiry to keep state self-consistent.

IsSuspended() returns false once nextRetry is in the past, but it leaves s.suspended = true. Any future caller that inspects the raw flag (or that adds non-time-based logic) will see a stale "suspended" state. The condition is duplicated in Snapshot() (line 131) and the two helpers can drift. Consider clearing the flag once it has lapsed — pure read API can become a write (still safe under s.mu), or simply expose only the time-based predicate.

 func (s *iceBackoffState) IsSuspended() bool {
 	s.mu.Lock()
 	defer s.mu.Unlock()
 	if !s.suspended {
 		return false
 	}
-	if time.Now().After(s.nextRetry) {
-		return false
-	}
-	return true
+	if time.Now().After(s.nextRetry) {
+		s.suspended = false
+		return false
+	}
+	return true
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 82 - 92, IsSuspended()
currently returns false when nextRetry has passed but leaves s.suspended true,
causing inconsistent state with Snapshot(); update IsSuspended() (in
iceBackoffState) to, under s.mu lock, check if s.suspended &&
time.Now().After(s.nextRetry) and if so clear s.suspended (and optionally reset
nextRetry) before returning false so the in-memory flags remain self-consistent;
ensure you modify only inside the locked section to avoid races and keep
Snapshot() behavior aligned.

161-169: 💤 Low value

Confirm: SetMaxBackoff resets the in-flight exponential schedule.

The doc comment states the failure counter is preserved, which it is, but buildBackoff calls bo.Reset() so the schedule restarts at InitialInterval. A peer that has climbed to e.g. ~8 m suspends will, after a server-pushed SetMaxBackoff to a different cap, drop back to ~1 m on the next failure. This is most likely intentional (new policy from server ⇒ fresh schedule), but the comment "preserves the failure counter" can be misread as "preserves the schedule"; consider clarifying.

-// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig
-// when the server pushes a new value. Rebuilds the internal backoff with
-// the new schedule but preserves the failure counter.
+// SetMaxBackoff updates the cap. Called from ConnMgr.UpdatedRemotePeerConfig
+// when the server pushes a new value. Rebuilds the internal exponential
+// schedule from scratch (so the next markFailure starts at InitialInterval)
+// but preserves the cumulative failure counter for status reporting.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 161 - 169, Update the
comment for SetMaxBackoff on iceBackoffState to clarify that although the
failure counter is preserved, the exponential backoff schedule is reset because
SetMaxBackoff calls buildBackoff(d) which invokes bo.Reset(); explicitly state
that the schedule (current interval) restarts at InitialInterval while the
failure count remains unchanged so callers won't be misled into thinking the
in-flight schedule is preserved.
client/internal/peer/conn.go (2)

1030-1056: 💤 Low value

Minor: redundant IsSuspended() check.

AttachICE checks iceBackoff.IsSuspended() (line 1034) and then calls attachICEListenerLocked, which performs the same check (line 1070). The first check returns early without an error; the second returns false (no-op). The behaviour is identical, so the first block is just an extra log path that fires before the listener attach. If the dual logging is intentional (different message), keep it; otherwise drop the outer check and rely on attachICEListenerLocked returning false to skip the SendOffer.

 func (conn *Conn) AttachICE() error {
 	conn.mu.Lock()
 	defer conn.mu.Unlock()
 
-	if conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() {
-		snap := conn.iceBackoff.Snapshot()
-		conn.Log.Debugf("ICE backoff active (failure #%d, retry at %s), staying on relay",
-			snap.Failures,
-			snap.NextRetry.Format("15:04:05"))
-		return nil
-	}
 	if conn.handshaker == nil {
 		return fmt.Errorf("AttachICE: handshaker not initialized (Open not called)")
 	}
 	if conn.workerICE == nil {
 		return fmt.Errorf("AttachICE: workerICE is nil (relay-forced mode)")
 	}
 
 	if !conn.attachICEListenerLocked() {
 		return nil
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 1030 - 1056, Remove the redundant
iceBackoff.IsSuspended() check and its logging from Conn.AttachICE and rely on
attachICEListenerLocked to handle suspended backoff (so AttachICE no longer
early-returns for suspension); specifically, delete the block that checks
conn.iceBackoff != nil && conn.iceBackoff.IsSuspended() and its
conn.Log.Debugf/return, leaving the rest of AttachICE (including the handshaker
nil checks and the call to attachICEListenerLocked and SendOffer) intact so
behavior is unchanged except for avoiding duplicate suspension logging.

19-20: 💤 Low value

Import group ordering.

shared/connectionmode is wedged between client/internal/metrics and client/internal/peer/conntype, breaking the client/... group's alphabetical ordering. goimports would put shared/... after the client/... group. Minor, but easy to fix.

 	"github.com/netbirdio/netbird/client/internal/metrics"
-	"github.com/netbirdio/netbird/shared/connectionmode"
 	"github.com/netbirdio/netbird/client/internal/peer/conntype"
 	"github.com/netbirdio/netbird/client/internal/peer/dispatcher"
 	"github.com/netbirdio/netbird/client/internal/peer/guard"
 	icemaker "github.com/netbirdio/netbird/client/internal/peer/ice"
 	"github.com/netbirdio/netbird/client/internal/peer/id"
 	"github.com/netbirdio/netbird/client/internal/peer/worker"
 	"github.com/netbirdio/netbird/client/internal/portforward"
 	"github.com/netbirdio/netbird/client/internal/stdnet"
 	"github.com/netbirdio/netbird/route"
 	relayClient "github.com/netbirdio/netbird/shared/relay/client"
+	"github.com/netbirdio/netbird/shared/connectionmode"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 19 - 20, The import block in
client/internal/peer/conn.go has shared/connectionmode placed between
client/internal packages, breaking the client/... group's alphabetical grouping;
reorder the imports so all client/internal/* imports (e.g.,
client/internal/metrics and client/internal/peer/conntype) are contiguous and
alphabetically ordered, and move
github.com/netbirdio/netbird/shared/connectionmode after the client/... group
(matching goimports grouping).
client/internal/peer/env_test.go (1)

33-34: 💤 Low value

Use constants instead of hardcoded environment variable names.

Lines 31–32 already reference EnvKeyNBConnectionMode and EnvKeyNBForceRelay. Lines 33–34 hardcode "NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and "NB_LAZY_CONN_INACTIVITY_THRESHOLD" as string literals.

The peer package defines these as private constants envEnableLazyConn and envInactivityThreshold (accessible within the same package), or import EnvEnableLazyConn and EnvInactivityThreshold from client/internal/lazyconn. Using constants ensures any rename is caught at compile time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/env_test.go` around lines 33 - 34, Replace the hardcoded
environment variable names in the test t.Setenv calls with the package
constants: use envEnableLazyConn instead of "NB_ENABLE_EXPERIMENTAL_LAZY_CONN"
and envInactivityThreshold instead of "NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these
constants are defined in the peer package); update the two t.Setenv calls in
env_test.go to reference those constants so renames are caught at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/android/preferences.go`:
- Around line 325-352: The setters store pointers to zero values instead of
clearing overrides, so change SetConnectionMode, SetRelayTimeoutSeconds (and
likewise SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds) to set the
corresponding pointer to nil when the sentinel is passed (empty string for
SetConnectionMode, 0 for the timeout/setter methods); otherwise allocate and
assign a pointer to the non-zero value. Update p.configInput.ConnectionMode and
p.configInput.RelayTimeoutSeconds (and the P2P fields) accordingly so apply()'s
nil check correctly reverts to server-pushed values.

In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The select loop currently calls g.onNetworkChange()
synchronously which blocks responsiveness to ctx.Done() and other channels;
change it to dispatch the callback asynchronously: when g.onNetworkChange != nil
spawn a background goroutine to call g.onNetworkChange() from the guard's select
branch (e.g., in the guard loop where srReconnectedChan is handled) so the loop
can continue processing ctx.Done and other events. To avoid unbounded concurrent
invocations, add a simple coalescing mechanism (a boolean flag or non-blocking
send to a buffered channel) adjacent to the async dispatch to ensure rapid
reconnect events don’t spawn redundant goroutines unless the previous handler
has completed; keep the flag/chan state local to the guard instance that owns
g.onNetworkChange.

In `@client/proto/daemon.proto`:
- Around line 208-218: The daemon handlers in client/server/server.go currently
ignore the new LoginRequest.connection_mode, p2p_timeout_seconds,
relay_timeout_seconds, and p2p_retry_max_seconds (and the same fields on
SetConfigRequest), so RPC callers can send them but the daemon neither applies
nor persists them; update the request handling code paths that process
LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: SetRelayTimeoutSeconds (and the two other setters that
cast int64 to uint32, e.g., SetP2pRetryMaxSeconds and the comparable setter)
currently cast negative secs into a huge uint32 (which can collide with the
disable sentinel); update each setter to validate secs before casting: if secs <
0 then clamp to 0 (or return an error if you prefer API change) and only then
assign via uint32(secs) into the pointer field so negative inputs cannot become
^uint32(0) or other unintended values.

In `@client/internal/peer/conn.go`:
- Around line 202-212: The code collapses the wire sentinel for "disabled" into
0 which conflicts with Open() treating 0 as "use default", causing disabled
peers to be re-enabled; fix by introducing a distinct in-process sentinel for
"disabled" (e.g. a named constant like DisabledP2pRetrySentinel) and preserve
the resolver wire sentinel instead of translating it to 0 in resolver/engine.go,
then update Open()'s logic around conn.config.P2pRetryMaxSeconds to: if it
equals the disabled sentinel, configure conn.iceBackoff as disabled (or leave
nil) so backoff is turned off; else if it equals 0 treat it as "use default" and
set backoffCap = DefaultP2PRetryMax, otherwise use the explicit value to
newIceBackoff / SetMaxBackoff; also ensure SetIceBackoffMax writes the same
in-process sentinel when disabling so Close()/Open() round-trips.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: IsConnected reads lastKnownState under muxAgent but
onConnectionStateChange writes it without locking, causing a data race; change
the WorkerICE.lastKnownState from a plain int to an atomic.Int32 (or similar),
update IsConnected() to use lastKnownState.Load() (while keeping muxAgent for
agent nil-check), and update onConnectionStateChange to use
lastKnownState.Store(newState) instead of assigning directly; ensure all other
accesses (e.g., in the code paths around lines referenced) use atomic Load/Store
and remove any unsynchronized direct reads/writes to lastKnownState while
leaving agent access protected by muxAgent as before.

---

Nitpick comments:
In `@client/internal/peer/conn.go`:
- Around line 1030-1056: Remove the redundant iceBackoff.IsSuspended() check and
its logging from Conn.AttachICE and rely on attachICEListenerLocked to handle
suspended backoff (so AttachICE no longer early-returns for suspension);
specifically, delete the block that checks conn.iceBackoff != nil &&
conn.iceBackoff.IsSuspended() and its conn.Log.Debugf/return, leaving the rest
of AttachICE (including the handshaker nil checks and the call to
attachICEListenerLocked and SendOffer) intact so behavior is unchanged except
for avoiding duplicate suspension logging.
- Around line 19-20: The import block in client/internal/peer/conn.go has
shared/connectionmode placed between client/internal packages, breaking the
client/... group's alphabetical grouping; reorder the imports so all
client/internal/* imports (e.g., client/internal/metrics and
client/internal/peer/conntype) are contiguous and alphabetically ordered, and
move github.com/netbirdio/netbird/shared/connectionmode after the client/...
group (matching goimports grouping).

In `@client/internal/peer/env_test.go`:
- Around line 33-34: Replace the hardcoded environment variable names in the
test t.Setenv calls with the package constants: use envEnableLazyConn instead of
"NB_ENABLE_EXPERIMENTAL_LAZY_CONN" and envInactivityThreshold instead of
"NB_LAZY_CONN_INACTIVITY_THRESHOLD" (these constants are defined in the peer
package); update the two t.Setenv calls in env_test.go to reference those
constants so renames are caught at compile time.

In `@client/internal/peer/guard/guard.go`:
- Around line 57-62: SetOnNetworkChange currently writes g.onNetworkChange
without synchronization which can race with concurrent Start() readers; change
the Guard implementation to store the callback via a sync/atomic pointer type
(e.g. sync/atomic.Pointer[func()]) instead of the plain field and update all
reads (e.g. in Start and any other goroutines that invoke the callback) to load
via atomic.Load/Pointer (or the generic Atomic.Pointer API) before invoking;
ensure SetOnNetworkChange uses atomic.Store/Pointer to set the callback and
nil-checks use the atomic load so access is race-detector-clean and retains the
same semantics.

In `@client/internal/peer/ice_backoff.go`:
- Around line 82-92: IsSuspended() currently returns false when nextRetry has
passed but leaves s.suspended true, causing inconsistent state with Snapshot();
update IsSuspended() (in iceBackoffState) to, under s.mu lock, check if
s.suspended && time.Now().After(s.nextRetry) and if so clear s.suspended (and
optionally reset nextRetry) before returning false so the in-memory flags remain
self-consistent; ensure you modify only inside the locked section to avoid races
and keep Snapshot() behavior aligned.
- Around line 161-169: Update the comment for SetMaxBackoff on iceBackoffState
to clarify that although the failure counter is preserved, the exponential
backoff schedule is reset because SetMaxBackoff calls buildBackoff(d) which
invokes bo.Reset(); explicitly state that the schedule (current interval)
restarts at InitialInterval while the failure count remains unchanged so callers
won't be misled into thinking the in-flight schedule is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d4622b57-deca-4419-b794-100392cfb3c4

📥 Commits

Reviewing files that changed from the base of the PR and between 2b6f78b and 6bf6777.

⛔ Files ignored due to path filters (2)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
  • shared/management/proto/management.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (20)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/engine.go
  • client/internal/peer/conn.go
  • client/internal/peer/env_test.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/worker_ice.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/server/setconfig_test.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
💤 Files with no reviewable changes (2)
  • client/ui/event_handler.go
  • client/ui/const.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/conn_mgr.go

Comment on lines +325 to +352
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
m := mode
p.configInput.ConnectionMode = &m
}

// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}

// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm there's no other "clear override" path that nils the pointer in apply() or elsewhere.
rg -nP --type=go -C2 'ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds' -g '!**/*_test.go' -g 'client/internal/profilemanager/**'

Repository: netbirdio/netbird

Length of output: 3375


🏁 Script executed:

#!/bin/bash
# Check the actual SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, SetP2pRetryMaxSeconds implementations
cat -n client/android/preferences.go | head -400 | tail -100

Repository: netbirdio/netbird

Length of output: 4108


🏁 Script executed:

#!/bin/bash
# Search for all occurrences of these four setter functions to ensure we catch all of them
rg -n 'func.*Set(ConnectionMode|RelayTimeoutSeconds|P2pTimeoutSeconds|P2pRetryMaxSeconds)' --type=go -g '!**/*_test.go'

Repository: netbirdio/netbird

Length of output: 426


🏁 Script executed:

#!/bin/bash
# Check if there are any other code paths that might clear these overrides (e.g., by setting to nil)
rg -n 'ConnectionMode.*=.*nil|RelayTimeoutSeconds.*=.*nil|P2pTimeoutSeconds.*=.*nil|P2pRetryMaxSeconds.*=.*nil' --type=go -g '!**/*_test.go'

Repository: netbirdio/netbird

Length of output: 3062


Fix the "clear override" logic to set pointer to nil, not to zero value.

The documented behaviour for SetConnectionMode, SetRelayTimeoutSeconds, SetP2pTimeoutSeconds, and SetP2pRetryMaxSeconds promises that passing 0 (or empty string) clears the override and reverts to the server-pushed value. However, each setter currently stores a pointer to the zero value, and the apply() function in client/internal/profilemanager/config.go (lines 608–626) checks only != nil before assigning. This overwrites the persisted server-pushed value with 0 or "" instead of clearing the override.

To honour the documented semantics, set the pointer to nil when the sentinel value is passed:

Suggested fix pattern
 func (p *Preferences) SetConnectionMode(mode string) {
+	if mode == "" {
+		p.configInput.ConnectionMode = nil
+		return
+	}
 	m := mode
 	p.configInput.ConnectionMode = &m
 }
@@
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		p.configInput.RelayTimeoutSeconds = nil
+		return
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }

Apply the same pattern to SetP2pTimeoutSeconds and SetP2pRetryMaxSeconds.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
m := mode
p.configInput.ConnectionMode = &m
}
// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
// SetConnectionMode stores a local override for the connection mode.
// Pass an empty string to clear the override (revert to following the
// server-pushed value).
func (p *Preferences) SetConnectionMode(mode string) {
if mode == "" {
p.configInput.ConnectionMode = nil
return
}
m := mode
p.configInput.ConnectionMode = &m
}
// GetRelayTimeoutSeconds returns the locally configured relay-worker
// inactivity timeout in seconds, or 0 if no override is set (follow
// server-pushed value, or built-in default if the server has none).
func (p *Preferences) GetRelayTimeoutSeconds() (int64, error) {
if p.configInput.RelayTimeoutSeconds != nil {
return int64(*p.configInput.RelayTimeoutSeconds), nil
}
cfg, err := profilemanager.ReadConfig(p.configInput.ConfigPath)
if err != nil {
return 0, err
}
return int64(cfg.RelayTimeoutSeconds), nil
}
// SetRelayTimeoutSeconds stores a local override for the relay timeout.
// Pass 0 to clear the override.
func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
if secs <= 0 {
p.configInput.RelayTimeoutSeconds = nil
return
}
v := uint32(secs)
p.configInput.RelayTimeoutSeconds = &v
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 325 - 352, The setters store
pointers to zero values instead of clearing overrides, so change
SetConnectionMode, SetRelayTimeoutSeconds (and likewise SetP2pTimeoutSeconds and
SetP2pRetryMaxSeconds) to set the corresponding pointer to nil when the sentinel
is passed (empty string for SetConnectionMode, 0 for the timeout/setter
methods); otherwise allocate and assign a pointer to the non-zero value. Update
p.configInput.ConnectionMode and p.configInput.RelayTimeoutSeconds (and the P2P
fields) accordingly so apply()'s nil check correctly reverts to server-pushed
values.

Comment on lines +144 to +147
// Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE
if g.onNetworkChange != nil {
g.onNetworkChange()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous onNetworkChange() call blocks ctx.Done() responsiveness

The callback is invoked on the guard's select goroutine. Per the PR description, it recreates the ICE worker — potentially involving non-trivial setup. Until the callback returns, the guard loop cannot react to ctx.Done() or any other channel (including another srReconnectedChan event). Consider dispatching it in a background goroutine:

⚡ Proposed fix
-            // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-            if g.onNetworkChange != nil {
-                g.onNetworkChange()
-            }
+            // Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+            // Dispatched asynchronously so the guard loop remains responsive to ctx.Done().
+            if g.onNetworkChange != nil {
+                go g.onNetworkChange()
+            }

If the callback itself serializes on an internal mutex (as ICE worker recreation likely does), concurrent calls from rapid reconnect events are already serialized. If not, consider adding a flag or draining mechanism to coalesce multiple rapid network-change signals.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Phase 3.5 (#5989): notify Conn to reset iceBackoff + recreate workerICE
if g.onNetworkChange != nil {
g.onNetworkChange()
}
// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
// Dispatched asynchronously so the guard loop remains responsive to ctx.Done().
if g.onNetworkChange != nil {
go g.onNetworkChange()
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The select loop
currently calls g.onNetworkChange() synchronously which blocks responsiveness to
ctx.Done() and other channels; change it to dispatch the callback
asynchronously: when g.onNetworkChange != nil spawn a background goroutine to
call g.onNetworkChange() from the guard's select branch (e.g., in the guard loop
where srReconnectedChan is handled) so the loop can continue processing ctx.Done
and other events. To avoid unbounded concurrent invocations, add a simple
coalescing mechanism (a boolean flag or non-blocking send to a buffered channel)
adjacent to the async dispatch to ensure rapid reconnect events don’t spawn
redundant goroutines unless the previous handler has completed; keep the
flag/chan state local to the guard instance that owns g.onNetworkChange.

Comment thread client/proto/daemon.proto
Comment on lines +208 to +218
// Phase 1 (#5989): peer-connection mode and idle timeouts.
// connection_mode is a string (relay-forced, p2p, p2p-lazy, p2p-dynamic,
// follow-server, or empty); parsed via connectionmode.ParseString at the
// daemon side. Empty means "no client-side override, use server value".
optional string connection_mode = 40;
optional uint32 p2p_timeout_seconds = 41;
optional uint32 relay_timeout_seconds = 42;

// Phase 3 (#5989): cap on the ICE-failure backoff schedule. 0 = use
// server-pushed value (or built-in default 15 min).
optional uint32 p2p_retry_max_seconds = 43;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire these new request fields through the daemon handlers.

client/server/server.go still ignores LoginRequest.connection_mode/... and SetConfigRequest.connection_mode/..., so callers can send these overrides and get a success response even though the daemon neither applies nor persists them. That leaves the new RPC surface out of sync with the implementation, especially for the GUI save flow.

Also applies to: 714-724

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/proto/daemon.proto` around lines 208 - 218, The daemon handlers in
client/server/server.go currently ignore the new LoginRequest.connection_mode,
p2p_timeout_seconds, relay_timeout_seconds, and p2p_retry_max_seconds (and the
same fields on SetConfigRequest), so RPC callers can send them but the daemon
neither applies nor persists them; update the request handling code paths that
process LoginRequest and SetConfigRequest in server.go to read these fields,
validate/parse connection_mode (using connectionmode.ParseString where
appropriate), apply them to the daemon's runtime config/state, and persist them
to the same storage path used by existing config fields so the GUI save flow and
subsequent calls reflect the overrides. Ensure the same wiring is added for any
helper methods that currently consume LoginRequest or SetConfigRequest (so both
request handling and persistence are consistent).

@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from 6bf6777 to b2b91ef Compare May 6, 2026 06:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
client/internal/peer/guard/guard.go (1)

144-147: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Synchronous onNetworkChange() still blocks the guard select loop

The callback is still invoked synchronously inside the srReconnectedChan branch. Until it returns, ctx.Done(), relayedConnDisconnected, iCEConnDisconnected, and any further srReconnectedChan events are all blocked. This was flagged in the previous review and remains unresolved.

🛠 Proposed fix (async dispatch with single-flight coalescing)
+		// networkChangePending prevents redundant concurrent recreations.
+		var networkChangePending atomic.Bool

 		case <-srReconnectedChan:
 			g.log.Debugf("has network changes, reset reconnection ticker")
 			ticker.Stop()
 			ticker = g.newReconnectTicker(ctx)
 			tickerChannel = ticker.C
 			iceState.reset()
-			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE
-			if g.onNetworkChange != nil {
-				g.onNetworkChange()
-			}
+			// Phase 3.5 (`#5989`): notify Conn to reset iceBackoff + recreate workerICE.
+			// Dispatched asynchronously so the guard loop stays responsive to ctx.Done().
+			if g.onNetworkChange != nil && !networkChangePending.Swap(true) {
+				go func() {
+					defer networkChangePending.Store(false)
+					g.onNetworkChange()
+				}()
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/guard/guard.go` around lines 144 - 147, The
srReconnectedChan branch currently calls g.onNetworkChange() synchronously
inside the guard select loop, which blocks other cases; change it to dispatch
the work asynchronously: when handling srReconnectedChan, if g.onNetworkChange
!= nil spawn a goroutine to invoke it instead of calling it inline and use a
single-flight/coalescing mechanism (e.g., a singleflight.Group or an in-struct
boolean guarded by a mutex/atomic) so concurrent srReconnectedChan events only
trigger one inflight onNetworkChange invocation; ensure the async worker
performs the existing reset iceBackoff and recreate workerICE logic and that you
respect ctx cancellation to avoid leaking goroutines.
client/internal/peer/worker_ice.go (1)

218-230: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Synchronize lastKnownState across the callback and network-change paths.

IsConnected() at Lines 227-230 now drives the skip-close branch, but onConnectionStateChange() still writes lastKnownState without the same synchronization at Lines 550 and 562. That leaves a data race on the exact close-or-skip decision this PR added.

#!/bin/bash
rg -n -C2 '\blastKnownState\b' client/internal/peer/worker_ice.go

Also applies to: 545-563

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 230, IsConnected reads
lastKnownState under w.muxAgent while onConnectionStateChange writes
lastKnownState without locking, creating a race; fix by protecting all accesses
to lastKnownState with the same mutex (w.muxAgent) — update
onConnectionStateChange (the callback that sets lastKnownState at/around where
it currently assigns at lines ~545-563) to acquire w.muxAgent.Lock() and defer
w.muxAgent.Unlock() around writes (and any reads), ensuring symmetry with
IsConnected and preventing the close-or-skip race.
client/internal/conn_mgr.go (2)

282-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Apply timeout pushes even when the managed mode itself doesn’t change.

Lines 282-320 only restart the manager on modeChanged, but initLazyManager() consumes relayTimeoutSecs and p2pTimeoutSecs once at Lines 557-562. A push from, for example, p2p-dynamic(60/30) to p2p-dynamic(10/5) updates the fields but leaves the live manager running with the old thresholds until a full restart.

Also applies to: 553-562

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 282 - 320, The runtime ignores
timeout-only pushes if the mode doesn’t change; detect when
relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs change while a manager is active
and apply them by restarting or updating the manager: in conn_mgr.go where
modeChanged/wasManaged/isManaged are computed (and where lazyConnMgr is
checked), add a branch that if isManaged && e.lazyConnMgr != nil and any of
relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from the manager’s
current thresholds, call e.closeManager(ctx) then e.initLazyManager(ctx),
e.startModeSideEffects(), and e.addPeersToLazyConnManager() (or alternatively
propagate the new values into the running manager); reference modeUsesLazyMgr,
lazyConnMgr, relayTimeoutSecs, p2pTimeoutSecs, p2pRetryMaxSecs,
propagateP2pRetryMaxToConns, initLazyManager, closeManager,
startModeSideEffects, and addPeersToLazyConnManager when making the change.

181-188: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t leave e.mode in lazy/dynamic when Rosenpass prevents the manager from starting.

Lines 185-187 and 313-315 return early, but e.mode is still kept/set to p2p-lazy or p2p-dynamic. New peers then inherit that mode, AddPeerConn() falls back to conn.Open() at Lines 391-395 because no manager exists, and ModeP2PDynamic still defers ICE listener registration in client/internal/peer/conn.go at Lines 239-242. Since ActivatePeer() also returns early with no manager at Lines 449-451, those peers can stay relay-only indefinitely.

Also applies to: 312-320

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 181 - 188, When you bail out from
starting the lazy/dynamic manager (both the modeUsesLazyMgr() check and the
e.rosenpassEnabled check), clear/unset the lazy P2P mode on e.mode so it no
longer advertises "p2p-lazy" or "p2p-dynamic" to new peers; specifically, before
each early return set e.mode to a non-lazy value (i.e. remove the lazy/dynamic
flag or assign the normal/direct P2P mode) so callers like AddPeerConn() and
ActivatePeer() won’t treat peers as manager-backed (avoiding the conn.Open()
fallback and deferred ICE registration in ModeP2PDynamic).
client/internal/peer/conn.go (1)

202-212: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the explicit “disable backoff” state round-trippable.

propagateP2pRetryMaxToConns() converts the wire sentinel to time.Duration(0), but Line 1159 stores that back as plain 0. After a later reopen, Lines 204-207 reinterpret 0 as “use default”, so a user-explicit disable silently turns back into the 15-minute cap.

Also applies to: 1156-1162

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/conn.go` around lines 202 - 212, The explicit "disable
backoff" sentinel is being lost because 0 is being reused for "use default";
introduce and use a distinct sentinel (e.g. const P2PRetryMaxDisabledSeconds =
-1) and update both propagateP2pRetryMaxToConns() (the code around 1156-1162)
and this initialization block in conn.go to interpret -1 as "disabled" (set
backoffCap to 0 or disable the iceBackoff entirely) while treating 0 as "use
built-in default" (map to DefaultP2PRetryMax); also ensure
newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so
the explicit disable round-trips unchanged.
client/ui/client_ui.go (1)

656-666: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently coerce invalid timeout input to 0.

Invalid/overflow timeout text currently collapses to 0, which can unintentionally clear overrides when saving settings. Parse should return an error and buildSetConfigRequest should surface it.

Proposed fix
-func parseUint32Field(text string) uint32 {
+func parseUint32Field(text string) (uint32, error) {
 	t := strings.TrimSpace(text)
 	if t == "" {
-		return 0
+		return 0, nil
 	}
 	v, err := strconv.ParseUint(t, 10, 32)
 	if err != nil {
-		return 0
+		return 0, fmt.Errorf("must be a whole number between 0 and %d", uint64(^uint32(0)))
 	}
-	return uint32(v)
+	return uint32(v), nil
 }
@@
-	relaySecs := parseUint32Field(s.iRelayTimeout.Text)
-	p2pSecs := parseUint32Field(s.iP2pTimeout.Text)
-	retrySecs := parseUint32Field(s.iP2pRetryMax.Text)
+	relaySecs, err := parseUint32Field(s.iRelayTimeout.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid relay timeout: %w", err)
+	}
+	p2pSecs, err := parseUint32Field(s.iP2pTimeout.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid P2P timeout: %w", err)
+	}
+	retrySecs, err := parseUint32Field(s.iP2pRetryMax.Text)
+	if err != nil {
+		return nil, fmt.Errorf("invalid P2P retry-max: %w", err)
+	}
 	req.RelayTimeoutSeconds = &relaySecs
 	req.P2PTimeoutSeconds = &p2pSecs
 	req.P2PRetryMaxSeconds = &retrySecs

Also applies to: 741-750

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/ui/client_ui.go` around lines 656 - 666, The parseUint32Field function
currently swallows parse/overflow errors and returns 0 which can accidentally
clear saved timeouts; change parseUint32Field to return (uint32, error) instead
of silently coercing — validate strings.TrimSpace(text) then call
strconv.ParseUint(..., 32) and return an explicit error on failure (including
overflow), and update all callers (notably buildSetConfigRequest) to propagate
or surface that error to the caller (e.g., return/build an error response
instead of using 0); apply the same change to the other similar parser at the
741-750 block so all timeout parsing surfaces errors instead of returning 0.
client/android/preferences.go (1)

349-352: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard negative timeout inputs before casting to uint32.

Negative values currently wrap into very large uint32 values and can silently corrupt timeout overrides. Treat <= 0 as “clear override” (0) before casting.

Proposed fix
 func (p *Preferences) SetRelayTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.RelayTimeoutSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.RelayTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pTimeoutSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.P2pTimeoutSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.P2pTimeoutSeconds = &v
 }
@@
 func (p *Preferences) SetP2pRetryMaxSeconds(secs int64) {
+	if secs <= 0 {
+		v := uint32(0)
+		p.configInput.P2pRetryMaxSeconds = &v
+		return
+	}
 	v := uint32(secs)
 	p.configInput.P2pRetryMaxSeconds = &v
 }

Also applies to: 370-373, 390-393

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/android/preferences.go` around lines 349 - 352, Guard against negative
timeout inputs in SetRelayTimeoutSeconds: treat any secs <= 0 as a clear
override by using 0 instead of allowing a negative int64 to wrap when cast to
uint32. Update SetRelayTimeoutSeconds to check secs <= 0 and set the config
pointer to a uint32(0) in that case; otherwise cast secs to uint32 and assign as
before. Apply the same pattern to the other timeout setter functions in this
file that perform the uint32 cast so they also handle non-positive inputs
safely.
🧹 Nitpick comments (1)
client/internal/lazyconn/manager/manager.go (1)

68-68: ⚡ Quick win

Remove unused inactivityThreshold field from Manager struct.

The field is assigned in NewManager (line 93) but never read; all timeout logic flows through resolvedTimeouts(). It's dead code from the migration and can be safely removed.

🧹 Removal diff
 type Manager struct {
 	engineCtx           context.Context
 	peerStore           *peerstore.Store
-	inactivityThreshold time.Duration
 
 	managedPeers         map[string]*lazyconn.PeerConfig
 	m := &Manager{
 		engineCtx:            engineCtx,
 		peerStore:            peerStore,
-		inactivityThreshold:  inactivity.DefaultInactivityThreshold,
 		managedPeers:         make(map[string]*lazyconn.PeerConfig),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/lazyconn/manager/manager.go` at line 68, Remove the dead
inactivityThreshold field from the Manager struct and eliminate its assignment
in NewManager: delete the inactivityThreshold declaration in the Manager type,
remove any parameter/argument or line in NewManager that sets
m.inactivityThreshold, and ensure no other code reads it (timeout logic should
remain using resolvedTimeouts()). Update related constructors/call sites to stop
passing that value and run tests to confirm nothing else references
inactivityThreshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@client/cmd/service_installer.go`:
- Around line 135-139: The call to applyConnectionModeFlagsToProfile currently
logs a warning on failure, allowing install/reconfigure to continue with
inconsistent connection-mode state; change this so that any error returned by
applyConnectionModeFlagsToProfile is propagated and causes the command to abort
(return the error) instead of printing a warning. Update both places where this
call appears (the install path around applyConnectionModeFlagsToProfile(cmd) and
the reconfigure path) to return the error up the call stack rather than using
cmd.PrintErrf, ensuring the command stops if persistence fails and no further
install/reconfigure steps run.

---

Duplicate comments:
In `@client/android/preferences.go`:
- Around line 349-352: Guard against negative timeout inputs in
SetRelayTimeoutSeconds: treat any secs <= 0 as a clear override by using 0
instead of allowing a negative int64 to wrap when cast to uint32. Update
SetRelayTimeoutSeconds to check secs <= 0 and set the config pointer to a
uint32(0) in that case; otherwise cast secs to uint32 and assign as before.
Apply the same pattern to the other timeout setter functions in this file that
perform the uint32 cast so they also handle non-positive inputs safely.

In `@client/internal/conn_mgr.go`:
- Around line 282-320: The runtime ignores timeout-only pushes if the mode
doesn’t change; detect when relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs
change while a manager is active and apply them by restarting or updating the
manager: in conn_mgr.go where modeChanged/wasManaged/isManaged are computed (and
where lazyConnMgr is checked), add a branch that if isManaged && e.lazyConnMgr
!= nil and any of relayTimeoutSecs/p2pTimeoutSecs/p2pRetryMaxSecs differ from
the manager’s current thresholds, call e.closeManager(ctx) then
e.initLazyManager(ctx), e.startModeSideEffects(), and
e.addPeersToLazyConnManager() (or alternatively propagate the new values into
the running manager); reference modeUsesLazyMgr, lazyConnMgr, relayTimeoutSecs,
p2pTimeoutSecs, p2pRetryMaxSecs, propagateP2pRetryMaxToConns, initLazyManager,
closeManager, startModeSideEffects, and addPeersToLazyConnManager when making
the change.
- Around line 181-188: When you bail out from starting the lazy/dynamic manager
(both the modeUsesLazyMgr() check and the e.rosenpassEnabled check), clear/unset
the lazy P2P mode on e.mode so it no longer advertises "p2p-lazy" or
"p2p-dynamic" to new peers; specifically, before each early return set e.mode to
a non-lazy value (i.e. remove the lazy/dynamic flag or assign the normal/direct
P2P mode) so callers like AddPeerConn() and ActivatePeer() won’t treat peers as
manager-backed (avoiding the conn.Open() fallback and deferred ICE registration
in ModeP2PDynamic).

In `@client/internal/peer/conn.go`:
- Around line 202-212: The explicit "disable backoff" sentinel is being lost
because 0 is being reused for "use default"; introduce and use a distinct
sentinel (e.g. const P2PRetryMaxDisabledSeconds = -1) and update both
propagateP2pRetryMaxToConns() (the code around 1156-1162) and this
initialization block in conn.go to interpret -1 as "disabled" (set backoffCap to
0 or disable the iceBackoff entirely) while treating 0 as "use built-in default"
(map to DefaultP2PRetryMax); also ensure
newIceBackoff()/iceBackoff.SetMaxBackoff accept/handle the disabled sentinel so
the explicit disable round-trips unchanged.

In `@client/internal/peer/guard/guard.go`:
- Around line 144-147: The srReconnectedChan branch currently calls
g.onNetworkChange() synchronously inside the guard select loop, which blocks
other cases; change it to dispatch the work asynchronously: when handling
srReconnectedChan, if g.onNetworkChange != nil spawn a goroutine to invoke it
instead of calling it inline and use a single-flight/coalescing mechanism (e.g.,
a singleflight.Group or an in-struct boolean guarded by a mutex/atomic) so
concurrent srReconnectedChan events only trigger one inflight onNetworkChange
invocation; ensure the async worker performs the existing reset iceBackoff and
recreate workerICE logic and that you respect ctx cancellation to avoid leaking
goroutines.

In `@client/internal/peer/worker_ice.go`:
- Around line 218-230: IsConnected reads lastKnownState under w.muxAgent while
onConnectionStateChange writes lastKnownState without locking, creating a race;
fix by protecting all accesses to lastKnownState with the same mutex
(w.muxAgent) — update onConnectionStateChange (the callback that sets
lastKnownState at/around where it currently assigns at lines ~545-563) to
acquire w.muxAgent.Lock() and defer w.muxAgent.Unlock() around writes (and any
reads), ensuring symmetry with IsConnected and preventing the close-or-skip
race.

In `@client/ui/client_ui.go`:
- Around line 656-666: The parseUint32Field function currently swallows
parse/overflow errors and returns 0 which can accidentally clear saved timeouts;
change parseUint32Field to return (uint32, error) instead of silently coercing —
validate strings.TrimSpace(text) then call strconv.ParseUint(..., 32) and return
an explicit error on failure (including overflow), and update all callers
(notably buildSetConfigRequest) to propagate or surface that error to the caller
(e.g., return/build an error response instead of using 0); apply the same change
to the other similar parser at the 741-750 block so all timeout parsing surfaces
errors instead of returning 0.

---

Nitpick comments:
In `@client/internal/lazyconn/manager/manager.go`:
- Line 68: Remove the dead inactivityThreshold field from the Manager struct and
eliminate its assignment in NewManager: delete the inactivityThreshold
declaration in the Manager type, remove any parameter/argument or line in
NewManager that sets m.inactivityThreshold, and ensure no other code reads it
(timeout logic should remain using resolvedTimeouts()). Update related
constructors/call sites to stop passing that value and run tests to confirm
nothing else references inactivityThreshold.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a37e050-20cd-4834-a220-5d197ba3378d

📥 Commits

Reviewing files that changed from the base of the PR and between 6bf6777 and b2b91ef.

⛔ Files ignored due to path filters (1)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (19)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/engine.go
  • client/internal/lazyconn/manager/manager.go
  • client/internal/peer/conn.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/worker_ice.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
💤 Files with no reviewable changes (2)
  • client/ui/const.go
  • client/ui/event_handler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/internal/peer/handshaker.go

Comment on lines +135 to +139
// Persist any profile-level connection-mode/timeout flags that
// were explicitly set so the daemon picks them up on first start.
if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make connection-mode persistence failures abort the command.

These flags are not passed through buildServiceArguments(), so if applyConnectionModeFlagsToProfile fails here and execution continues, install/reconfigure completes with different connection-mode/timeout behavior than the user explicitly requested. This should return an error instead of downgrading to a warning.

Suggested change
-		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
-			cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
-		}
+		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
+			return fmt.Errorf("persist connection-mode flags: %w", err)
+		}

Apply the same change in both the install and reconfigure paths.

Based on learnings methods returning errors should be propagated early rather than continuing with partially applied state.

Also applies to: 263-265

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/service_installer.go` around lines 135 - 139, The call to
applyConnectionModeFlagsToProfile currently logs a warning on failure, allowing
install/reconfigure to continue with inconsistent connection-mode state; change
this so that any error returned by applyConnectionModeFlagsToProfile is
propagated and causes the command to abort (return the error) instead of
printing a warning. Update both places where this call appears (the install path
around applyConnectionModeFlagsToProfile(cmd) and the reconfigure path) to
return the error up the call stack rather than using cmd.PrintErrf, ensuring the
command stops if persistence fails and no further install/reconfigure steps run.

@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from b2b91ef to 025a750 Compare May 6, 2026 15:01
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (4)
client/cmd/service_installer.go (1)

135-139: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail install/reconfigure when profile persistence fails

Line 137 and Line 263 currently downgrade persistence failure to a warning, which can complete the command with user-requested connection-mode/timeout settings silently not applied.

Suggested fix
-		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
-			cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
-		}
+		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
+			return fmt.Errorf("persist connection-mode flags: %w", err)
+		}
...
-		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
-			cmd.PrintErrf("Warning: failed to persist connection-mode flags: %v\n", err)
-		}
+		if err := applyConnectionModeFlagsToProfile(cmd); err != nil {
+			return fmt.Errorf("persist connection-mode flags: %w", err)
+		}

Also applies to: 263-265

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/cmd/service_installer.go` around lines 135 - 139, The code currently
logs a warning when applyConnectionModeFlagsToProfile(cmd) fails (via
cmd.PrintErrf) which allows install/reconfigure to succeed without persisting
user-requested connection-mode/timeout settings; change this so the error is
returned (or causes the command to fail) instead of being downgraded to a
warning: locate the call to applyConnectionModeFlagsToProfile in
service_installer.go and replace the cmd.PrintErrf warning path with logic that
returns the error (or wraps it with context) from the enclosing function so the
installer/reconfigure aborts on persistence failure; apply the same change for
the other occurrence around lines 263-265 to ensure both install and reconfigure
fail on persistence errors.
client/internal/peer/worker_ice.go (1)

218-231: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Data race on lastKnownState is still present and now matters more.

This is the same concern raised earlier: IsConnected() reads w.lastKnownState under muxAgent, but the onConnectionStateChange callback writes it at lines 550 and 562 without holding the mutex. With the new Conn.onNetworkChange path using IsConnected() to decide whether to skip workerICE.Close(), a torn or stale read (or simply a race-detector failure in CI) can:

  • skip Close on an agent that has just transitioned away from Connected — leaving a dead agent in place across a network change, or
  • close an agent that is still in Connected — the exact case the new "skip when connected" path is meant to avoid.

Promote lastKnownState to atomic.Int32 (or guard the writes with muxAgent); the existing reads in IsConnected/state-change branches map cleanly to atomic load/store.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/worker_ice.go` around lines 218 - 231, The data race is
on WorkerICE.lastKnownState: IsConnected() reads it under muxAgent while the
onConnectionStateChange callback updates it without the mutex; change
lastKnownState to an atomic (e.g., atomic.Int32) and replace reads/writes in
IsConnected and both onConnectionStateChange sites (the assignments at the two
callback locations) with atomic load/store operations, or alternatively move
those writes to be under muxAgent.Lock()/Unlock()—ensure all accesses use the
same atomic or mutex scheme so IsConnected, onConnectionStateChange, and any
other reader/writer (WorkerICE, IsConnected, onConnectionStateChange) are
consistent.
client/internal/conn_mgr.go (2)

692-700: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

ServerPushedP2pRetryMaxSecs masking 0 with the daemon default still contradicts the "not yet received" semantics used in server.go.

This is the same concern as the earlier review: the comment in server.go GetConfig (lines 1518–1522) says all server-pushed values are zero/empty when no PeerConfig has been received, but this getter substitutes DefaultP2PRetryMax whenever the stored value is 0. As a result, spP2pRetMax is never 0 once the engine is running, so the UI has no reliable "server hasn't pushed a value yet" sentinel for this field — unlike ServerPushedP2pTimeoutSecs and ServerPushedRelayTimeoutSecs, which return raw zeros.

Either move the default-fallback into a separate helper (e.g., ServerPushedP2pRetryMaxSecsOrDefault()) consumed by the UI, or document very clearly that this getter intentionally returns the daemon default and update the server.go comment to match. As written today the two diverge.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 692 - 700,
ServerPushedP2pRetryMaxSecs currently masks a stored zero with
peer.DefaultP2PRetryMax which breaks the "zero means not yet received"
convention used in server.go; change this by making ServerPushedP2pRetryMaxSecs
return the raw stored value (no default masking) and add a new helper named
ServerPushedP2pRetryMaxSecsOrDefault() that returns the stored value or
peer.DefaultP2PRetryMax when zero, then update callers (UI should use the
OrDefault helper) or alternatively update the server.go comment if you
intentionally want the getter to supply the daemon default; locate methods by
name ServerPushedP2pRetryMaxSecs and add the new helper with identical locking
(spMu) semantics.

277-323: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Same-mode timeout/retry-max changes still don't restart the lazy manager.

This is the same concern raised earlier in the review history: when only the timeouts (or p2pRetryMaxSecs) change inside the same managed mode (e.g., p2p-dynamic 60s → 10s, mode unchanged), the early-return at lines 282–284 is bypassed but then neither the modeChanged && wasManaged && isManaged branch (lines 304–310) nor the lazyConnMgr == nil branch (lines 312–321) fires. The new relayTimeoutSecs / p2pTimeoutSecs are stored on ConnMgr but never applied to the running manager.Manager, since initLazyManager is the only consumer of those fields.

Capture timeoutsChanged before overwriting the old fields and trigger a restart in the same-mode managed case as well:

♻️ Suggested fix
-	prev := e.mode
-	e.mode = newMode
-	e.relayTimeoutSecs = newRelay
-	e.p2pTimeoutSecs = newP2P
-	e.p2pRetryMaxSecs = newP2pRetry
-	e.propagateP2pRetryMaxToConns()
+	prev := e.mode
+	timeoutsChanged := newRelay != e.relayTimeoutSecs || newP2P != e.p2pTimeoutSecs
+	e.mode = newMode
+	e.relayTimeoutSecs = newRelay
+	e.p2pTimeoutSecs = newP2P
+	e.p2pRetryMaxSecs = newP2pRetry
+	e.propagateP2pRetryMaxToConns()

 	wasManaged := modeUsesLazyMgr(prev)
 	isManaged := modeUsesLazyMgr(newMode)
 	modeChanged := prev != newMode

 	if modeChanged && wasManaged && !isManaged {
 		log.Infof("lazy/dynamic connection manager disabled by management push (mode=%s)", newMode)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 		return nil
 	}

-	if modeChanged && wasManaged && isManaged {
-		// Switching between lazy and dynamic at runtime: tear down the
-		// existing manager so initLazyManager picks up the new timeouts.
-		log.Infof("lazy/dynamic mode change %s -> %s, restarting manager", prev, newMode)
+	if wasManaged && isManaged && (modeChanged || timeoutsChanged) {
+		log.Infof("restarting lazy/dynamic manager to apply updated mode/timeouts (%s -> %s, timeouts-changed=%t)",
+			prev, newMode, timeoutsChanged)
 		e.closeManager(ctx)
 		e.statusRecorder.UpdateLazyConnection(false)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 277 - 323, The code updates
connection timeouts but doesn't restart the running lazy manager when only
timeouts/p2pRetryMaxSecs change while remaining in a managed mode; capture
whether timeouts changed before you overwrite fields (compare
newRelay/newP2P/newP2pRetry to
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs into a timeoutsChanged
boolean) and, after computing wasManaged/isManaged, if wasManaged && isManaged
&& timeoutsChanged then log a restart, call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), and then re-init the manager via
e.initLazyManager(ctx) followed by e.startModeSideEffects() and return
e.addPeersToLazyConnManager() so the new timeouts are applied to the running
manager (use existing functions resolveConnectionMode, modeUsesLazyMgr,
initLazyManager, closeManager, propagateP2pRetryMaxToConns,
startModeSideEffects, addPeersToLazyConnManager to locate code).
🧹 Nitpick comments (4)
client/internal/conn_mgr_test.go (1)

206-221: 💤 Low value

Race-safety test only catches the race when run under -race.

The test asserts no crash but no values. Without go test -race, a missing lock would not produce a failure here. Consider adding a comment so future readers know this guard relies on the race detector, and optionally have at least one assertion (e.g., final value is one of the writes) so the test is not a no-op without -race.

♻️ Suggested clarification
+// Run with `-race` to validate concurrent reads/writes on serverPushed* fields.
 func TestConnMgr_ServerPushedFieldsAreRaceSafe(t *testing.T) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr_test.go` around lines 206 - 221, Update
TestConnMgr_ServerPushedFieldsAreRaceSafe to document that it relies on the Go
race detector and to make it meaningful without -race: add a short comment above
the test mentioning that the test is intended to catch races only when run with
`-race` and references the guarded fields (spMu, serverPushedRelayTimeoutSecs)
and accessor ServerPushedRelayTimeoutSecs(); then add a lightweight assertion
after the goroutine completes that the final value returned by
cm.ServerPushedRelayTimeoutSecs() is within the range of expected writes (e.g.,
0..999) or equals the last written value to ensure the test is not a no-op when
run without the race detector.
client/internal/conn_mgr.go (1)

196-214: 💤 Low value

startModeSideEffects can spawn the dynamic-inactivity loop with a nil lazyCtx.

startModeSideEffects is called both from Start (after initLazyManager, which sets e.lazyCtx) and from UpdatedRemotePeerConfig line 319 (also after initLazyManager). So today the contract holds. But the helper itself does not guard against being invoked outside that ordering, and a future caller could easily miss it. Since runDynamicInactivityLoop already null-checks lazyConnMgr and InactivityManager(), consider adding a parallel guard for e.lazyCtx == nil (or, simpler, branching on e.isStartedWithLazyMgr() before spawning) so the goroutine cannot be launched in an undefined state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/conn_mgr.go` around lines 196 - 214, startModeSideEffects can
start the dynamic-inactivity goroutine when e.lazyCtx is nil; fix it by guarding
the dynamic branch so the goroutine is only spawned when the lazy manager is
initialized (e.g. check e.isStartedWithLazyMgr() or e.lazyCtx != nil) before
calling e.wg.Add(1) and launching the goroutine that calls
runDynamicInactivityLoop(e.lazyCtx); keep the existing UpdateLazyConnection call
for ModeP2PLazy but ensure ModeP2PDynamic only proceeds when initLazyManager has
been run (refer to startModeSideEffects, runDynamicInactivityLoop,
initLazyManager, UpdatedRemotePeerConfig, and e.lazyCtx).
client/internal/peer/ice_backoff.go (1)

161-169: 💤 Low value

SetMaxBackoff discards the in-flight backoff progress on every change.

buildBackoff does an unconditional bo.Reset(), so any active doubling progress is lost when the management server pushes a new cap. In the common path (cap goes up or down at peer steady state) this is fine, but if a peer is mid-failure-storm and the cap is changed to a smaller-but-nonzero value, the next markFailure restarts at InitialInterval (~1m) instead of clamping the in-flight delay to the new cap. If that surprises operators, document it — or, better, only rebuild when the new cap is smaller than the current nextRetry delta.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/ice_backoff.go` around lines 161 - 169, SetMaxBackoff
currently throws away in-flight backoff progress by unconditionally calling
buildBackoff(d); change it to only rebuild the backoff when the new cap is
smaller than the current pending retry interval so in-flight doubling is
preserved otherwise. Concretely, inside iceBackoffState.SetMaxBackoff check the
current next retry delta from the existing s.bo (e.g. via the backoff
implementation's NextBackOff/Peek or a saved next value used by markFailure); if
d >= currentNext then just update s.maxBackoff = d and return, otherwise rebuild
s.bo = buildBackoff(d) and set s.maxBackoff = d so an ongoing storm will be
clamped down only when the new cap is smaller than the in-flight delay. Ensure
you use the same symbols: SetMaxBackoff, s.maxBackoff, s.bo, buildBackoff and
keep behavior for equal durations as before.
client/internal/peer/handshaker.go (1)

148-153: 💤 Low value

Minor: RemoteICESupported() is evaluated twice per dispatch; the second call could reorder around readICEListener().

RemoteICESupported() is an atomic load and cheap, but re-evaluating it after readICEListener() opens a tiny window where the remote-ICE state flipped between the two calls (e.g., remote stops supporting ICE just before the debug log is decided). It would not change correctness, but caching the value once per branch makes the intent clearer and the log message more accurate:

♻️ Suggested cleanup
-			if iceListener := h.readICEListener(); iceListener != nil && h.RemoteICESupported() {
-				iceListener(&remoteOfferAnswer)
-			} else if h.RemoteICESupported() {
-				h.log.Debugf("remote OFFER (session %s) without local ICE listener (relay-forced mode or peer in ICE backoff)",
-					remoteOfferAnswer.SessionIDString())
-			}
+			if remoteICE := h.RemoteICESupported(); remoteICE {
+				if iceListener := h.readICEListener(); iceListener != nil {
+					iceListener(&remoteOfferAnswer)
+				} else {
+					h.log.Debugf("remote OFFER (session %s) without local ICE listener (relay-forced mode or peer in ICE backoff)",
+						remoteOfferAnswer.SessionIDString())
+				}
+			}

Also applies to: 173-178

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@client/internal/peer/handshaker.go` around lines 148 - 153, Cache the result
of h.RemoteICESupported() in a local variable and reuse it instead of calling
RemoteICESupported() twice around readICEListener(); e.g., evaluate
remoteICESupported := h.RemoteICESupported() before calling h.readICEListener()
and use remoteICESupported in the if/else condition and in the h.log.Debugf
branch to avoid a race/reordering window; apply the same change to the similar
block around lines referencing readICEListener() and RemoteICESupported() (the
second occurrence mentioned).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@client/cmd/service_installer.go`:
- Around line 135-139: The code currently logs a warning when
applyConnectionModeFlagsToProfile(cmd) fails (via cmd.PrintErrf) which allows
install/reconfigure to succeed without persisting user-requested
connection-mode/timeout settings; change this so the error is returned (or
causes the command to fail) instead of being downgraded to a warning: locate the
call to applyConnectionModeFlagsToProfile in service_installer.go and replace
the cmd.PrintErrf warning path with logic that returns the error (or wraps it
with context) from the enclosing function so the installer/reconfigure aborts on
persistence failure; apply the same change for the other occurrence around lines
263-265 to ensure both install and reconfigure fail on persistence errors.

In `@client/internal/conn_mgr.go`:
- Around line 692-700: ServerPushedP2pRetryMaxSecs currently masks a stored zero
with peer.DefaultP2PRetryMax which breaks the "zero means not yet received"
convention used in server.go; change this by making ServerPushedP2pRetryMaxSecs
return the raw stored value (no default masking) and add a new helper named
ServerPushedP2pRetryMaxSecsOrDefault() that returns the stored value or
peer.DefaultP2PRetryMax when zero, then update callers (UI should use the
OrDefault helper) or alternatively update the server.go comment if you
intentionally want the getter to supply the daemon default; locate methods by
name ServerPushedP2pRetryMaxSecs and add the new helper with identical locking
(spMu) semantics.
- Around line 277-323: The code updates connection timeouts but doesn't restart
the running lazy manager when only timeouts/p2pRetryMaxSecs change while
remaining in a managed mode; capture whether timeouts changed before you
overwrite fields (compare newRelay/newP2P/newP2pRetry to
e.relayTimeoutSecs/e.p2pTimeoutSecs/e.p2pRetryMaxSecs into a timeoutsChanged
boolean) and, after computing wasManaged/isManaged, if wasManaged && isManaged
&& timeoutsChanged then log a restart, call e.closeManager(ctx),
e.statusRecorder.UpdateLazyConnection(false), and then re-init the manager via
e.initLazyManager(ctx) followed by e.startModeSideEffects() and return
e.addPeersToLazyConnManager() so the new timeouts are applied to the running
manager (use existing functions resolveConnectionMode, modeUsesLazyMgr,
initLazyManager, closeManager, propagateP2pRetryMaxToConns,
startModeSideEffects, addPeersToLazyConnManager to locate code).

In `@client/internal/peer/worker_ice.go`:
- Around line 218-231: The data race is on WorkerICE.lastKnownState:
IsConnected() reads it under muxAgent while the onConnectionStateChange callback
updates it without the mutex; change lastKnownState to an atomic (e.g.,
atomic.Int32) and replace reads/writes in IsConnected and both
onConnectionStateChange sites (the assignments at the two callback locations)
with atomic load/store operations, or alternatively move those writes to be
under muxAgent.Lock()/Unlock()—ensure all accesses use the same atomic or mutex
scheme so IsConnected, onConnectionStateChange, and any other reader/writer
(WorkerICE, IsConnected, onConnectionStateChange) are consistent.

---

Nitpick comments:
In `@client/internal/conn_mgr_test.go`:
- Around line 206-221: Update TestConnMgr_ServerPushedFieldsAreRaceSafe to
document that it relies on the Go race detector and to make it meaningful
without -race: add a short comment above the test mentioning that the test is
intended to catch races only when run with `-race` and references the guarded
fields (spMu, serverPushedRelayTimeoutSecs) and accessor
ServerPushedRelayTimeoutSecs(); then add a lightweight assertion after the
goroutine completes that the final value returned by
cm.ServerPushedRelayTimeoutSecs() is within the range of expected writes (e.g.,
0..999) or equals the last written value to ensure the test is not a no-op when
run without the race detector.

In `@client/internal/conn_mgr.go`:
- Around line 196-214: startModeSideEffects can start the dynamic-inactivity
goroutine when e.lazyCtx is nil; fix it by guarding the dynamic branch so the
goroutine is only spawned when the lazy manager is initialized (e.g. check
e.isStartedWithLazyMgr() or e.lazyCtx != nil) before calling e.wg.Add(1) and
launching the goroutine that calls runDynamicInactivityLoop(e.lazyCtx); keep the
existing UpdateLazyConnection call for ModeP2PLazy but ensure ModeP2PDynamic
only proceeds when initLazyManager has been run (refer to startModeSideEffects,
runDynamicInactivityLoop, initLazyManager, UpdatedRemotePeerConfig, and
e.lazyCtx).

In `@client/internal/peer/handshaker.go`:
- Around line 148-153: Cache the result of h.RemoteICESupported() in a local
variable and reuse it instead of calling RemoteICESupported() twice around
readICEListener(); e.g., evaluate remoteICESupported := h.RemoteICESupported()
before calling h.readICEListener() and use remoteICESupported in the if/else
condition and in the h.log.Debugf branch to avoid a race/reordering window;
apply the same change to the similar block around lines referencing
readICEListener() and RemoteICESupported() (the second occurrence mentioned).

In `@client/internal/peer/ice_backoff.go`:
- Around line 161-169: SetMaxBackoff currently throws away in-flight backoff
progress by unconditionally calling buildBackoff(d); change it to only rebuild
the backoff when the new cap is smaller than the current pending retry interval
so in-flight doubling is preserved otherwise. Concretely, inside
iceBackoffState.SetMaxBackoff check the current next retry delta from the
existing s.bo (e.g. via the backoff implementation's NextBackOff/Peek or a saved
next value used by markFailure); if d >= currentNext then just update
s.maxBackoff = d and return, otherwise rebuild s.bo = buildBackoff(d) and set
s.maxBackoff = d so an ongoing storm will be clamped down only when the new cap
is smaller than the in-flight delay. Ensure you use the same symbols:
SetMaxBackoff, s.maxBackoff, s.bo, buildBackoff and keep behavior for equal
durations as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 22fe279d-fe0f-4407-a120-44b05a530f5b

📥 Commits

Reviewing files that changed from the base of the PR and between b2b91ef and 025a750.

⛔ Files ignored due to path filters (1)
  • client/proto/daemon.pb.go is excluded by !**/*.pb.go
📒 Files selected for processing (19)
  • client/android/client.go
  • client/android/preferences.go
  • client/cmd/service.go
  • client/cmd/service_installer.go
  • client/internal/conn_mgr.go
  • client/internal/conn_mgr_test.go
  • client/internal/debug/debug.go
  • client/internal/engine.go
  • client/internal/peer/conn.go
  • client/internal/peer/guard/guard.go
  • client/internal/peer/handshaker.go
  • client/internal/peer/ice_backoff.go
  • client/internal/peer/ice_backoff_test.go
  • client/internal/peer/worker_ice.go
  • client/proto/daemon.proto
  • client/server/server.go
  • client/ui/client_ui.go
  • client/ui/const.go
  • client/ui/event_handler.go
💤 Files with no reviewable changes (2)
  • client/ui/const.go
  • client/ui/event_handler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/proto/daemon.proto

MichaelUray and others added 16 commits May 6, 2026 16:15
…nfig

Additive change for issue netbirdio#5989 Phase 1. New fields use new tag numbers
(11, 12, 13); existing fields (including LazyConnectionEnabled tag 6) are
unchanged so old clients ignore the additions and old servers send
UNSPECIFIED, which the new client maps back via the legacy boolean.

Note: the regenerated pb.go files now report protoc v5.29.3 in their
header (this branch was generated with locally-installed protoc 29.3
instead of upstream's v7.34.1). Functionally identical; header diff is
the only delta beyond the actual schema additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defines Mode enum (relay-forced, p2p, p2p-lazy, p2p-dynamic plus the
client-only sentinels Unspecified and FollowServer), ParseString for
CLI/env input, ToProto/FromProto for wire translation, and the two
backwards-compat helpers ResolveLegacyLazyBool / ToLazyConnectionEnabled
that bridge the old Settings.LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989. Pure addition -- no existing callers touched
in this commit; the engine/conn_mgr migration follows in subsequent
commits in the same PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on warns

NB_CONNECTION_MODE wins over the legacy pair (NB_FORCE_RELAY,
NB_ENABLE_EXPERIMENTAL_LAZY_CONN); when the legacy pair is set
together, NB_FORCE_RELAY wins (most-restrictive, mirrors the
group-conflict rule from issue netbirdio#5990).

Each legacy var emits a one-shot deprecation warning when it actually
contributes to the resolved mode. NB_LAZY_CONN_INACTIVITY_THRESHOLD
becomes an alias for the future relay_timeout setting and warns once.

IsForceRelayed() is kept for callers that have not yet been migrated
(conn.go, statusrecorder); they will be updated in the engine/conn
refactor commits later in this PR.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new CLI flags map onto the new connection-mode plumbing:
- --connection-mode <relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server>
- --relay-timeout <seconds>
- --p2p-timeout <seconds>

Plumbed through three sites in cmd/up.go (SetConfigRequest, ConfigInput,
LoginRequest), persisted in profilemanager.Config, and added as new
fields on the daemon.proto IPC messages. Empty / not-changed flags fall
back to the server-pushed value (which itself falls back to the legacy
lazy_connection_enabled boolean for old servers).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EngineConfig gains ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds.
ConnMgr now stores the resolved Mode plus the raw inputs (env, config)
so it can re-resolve when the server pushes a new PeerConfig.

UpdatedRemoteFeatureFlag is renamed to UpdatedRemotePeerConfig and
takes the full PeerConfig pointer; a thin shim with the old name
delegates to it for callers that haven't been updated yet.

connect.go copies the three new fields from profilemanager.Config into
the EngineConfig builder, with a tolerant parser that logs and falls
through to Unspecified on invalid input.

Phase 1 of issue netbirdio#5989. peer/conn.go forwarding follows in C4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ConnConfig gains a Mode field forwarded from the engine. Open() now
checks Mode == ModeRelayForced instead of calling the global env-reader
IsForceRelayed(). The local 'forceRelay' variable name is renamed to
'skipICE' to make the new branching intent explicit.

The PeerStateUpdate block at the end of Open() also reads from
conn.config.Mode now, so the StatusRecorder sees the per-peer mode
rather than the global env var.

A single remaining caller of IsForceRelayed() (srWatcher.Start in
engine.go) is left for a follow-up; that path uses a process-wide flag
not per-peer state, so it can be migrated in Phase 2 once srWatcher
itself learns about ConnectionMode.

Phase 1 of issue netbirdio#5989. Engine forwarding (C5) follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
createPeerConn now reads ConnMgr.Mode() and copies it into
peer.ConnConfig, so the per-peer Open() loop in conn.go can take the
ModeRelayForced skip-ICE branch without reading the global env var.

This is the last wiring commit for the client side of Phase 1; the
server-side mgmt changes (Settings + OpenAPI + handler + audit +
NetworkMap-build) follow in Section D.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three fields are nullable to distinguish 'use built-in default'
(NULL) from explicit values (incl. 0 = never tear down). Copy() now
deep-clones the new pointer fields via two small helpers.

GORM AutoMigrate creates the new columns at first start; existing
accounts have NULL in all three columns and resolve via the legacy
LazyConnectionEnabled boolean.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tings

Three new optional, nullable fields with descriptions of the NULL =
built-in-default semantics and the Phase-1-vs-Phase-2 status of
p2p-dynamic. Regenerated types.gen.go via the existing oapi-codegen
tooling.

The generated AccountSettingsConnectionMode enum has the canonical
values relay-forced / p2p / p2p-lazy / p2p-dynamic, plus a Valid()
helper for handler-side validation.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PUT /api/accounts/{id} now accepts connection_mode (validated against
the four-value enum via the generated AccountSettingsConnectionMode.
Valid()), p2p_timeout_seconds and relay_timeout_seconds. NULL in the
JSON body keeps the existing value untouched (= "no client-side
override on this round-trip"); explicit NULL-clear via API uses a
distinct PATCH-style call which is out-of-scope for Phase 1.

Response payload mirrors the input fields back as nullable so the
dashboard can distinguish "use default" from "explicit value".

Phase 1 of issue netbirdio#5989. Audit-event emission follows in D5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccountConnectionModeChanged (121), AccountRelayTimeoutChanged (122),
AccountP2pTimeoutChanged (123) -- emitted from account.go when settings
change. Per-peer / per-group event codes are reserved for Phase 3
(issue netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
handleConnectionModeSettings is invoked from the same diff-detection
block as handleLazyConnectionSettings; emits one StoreEvent per
changed field (ConnectionMode, RelayTimeoutSeconds, P2pTimeoutSeconds)
with old/new values in the meta payload.

Four small ptr-equality / deref helpers are added for nullable string
and uint32 fields. They are package-private and named after the
existing convention used elsewhere in the package.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elds

Two changes in one commit because they're inseparable:

1. Move client/internal/peer/connectionmode/ to shared/connectionmode/.
   The package now needs to be importable from BOTH client/ and
   management/ (which is impossible while it lives under client/internal/
   per Go's internal-package rule). All imports updated; tests pass on
   both sides.

2. Extend management/internals/shared/grpc/conversion.go::toPeerConfig
   to populate the three new PeerConfig fields (ConnectionMode,
   P2PTimeoutSeconds, RelayTimeoutSeconds) using the connectionmode
   helpers. The legacy LazyConnectionEnabled boolean is now derived
   from the resolved Mode via ToLazyConnectionEnabled() rather than
   copied verbatim from Settings -- this is the central backwards-compat
   contract: old clients see only the boolean, new clients prefer the
   explicit enum and ignore the bool.

Resolution rules (Phase 1, account-wide only):
- Settings.ConnectionMode != nil and parses -> wins
- Otherwise -> ResolveLegacyLazyBool(LazyConnectionEnabled)
- timeouts: Settings.RelayTimeoutSeconds / P2pTimeoutSeconds when non-NULL,
  else 0 (= server has no preference; client uses built-in default)

Per-peer / per-group resolution comes in Phase 3 (netbirdio#5990).

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Nine sub-cases cover the Phase-1 resolution matrix from spec section 3:
- no settings -> default (P2P + lazy=false)
- legacy bool only -> mapped via ResolveLegacyLazyBool
- explicit ConnectionMode -> wins over the legacy bool
- timeouts propagate
- garbage ConnectionMode value -> tolerant fallback to legacy bool

Particular attention to the structural compat gap: relay-forced cannot
be expressed via the legacy boolean, so the wire field for old clients
is sent as false. Documented in the spec, asserted here.

Existing TestAccount_GetPeerNetworkMap remains green: existing test
peers have ConnectionMode=NULL in Settings, falls through to the legacy
ResolveLegacyLazyBool(false) -> ModeP2P -> wire bool false.

Phase 1 of issue netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs to dynamically detach the ICE listener at runtime
(when p2p-dynamic mode hits its ICE-inactivity threshold). Adds the
RemoveICEListener method and extends the Add/RemoveICEListener pair to
hold h.mu; Listen() now reads h.iceListener through readICEListener
under the same mutex, fixing a latent race that didn't matter while
AddICEListener was the only writer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of netbirdio#5989 needs the daemon to fire ICE-teardown and relay-
teardown at different idle thresholds. Adds NewManagerWithTwoTimers
plus iceInactiveChan / relayInactiveChan; the periodic check loop
fires each channel when its threshold elapses. Threshold == 0 disables
that channel (peer.lazy: iceTimeout=0; peer.dynamic: both>0; eager
modes don't register peers at all).

NewManager (Phase-1 entry point) becomes a thin wrapper that
delegates to newManager with iceTimeout=0, preserving p2p-lazy
semantics exactly. The Phase-1 InactivePeersChan now aliases the
new RelayInactiveChan so existing callers (engine.go p2p-lazy path)
continue to work unchanged.

Five sub-tests cover the timeout matrix:
- TwoTimers_OnlyICEFires (peer idle between thresholds)
- TwoTimers_BothFire (peer idle past both)
- TwoTimers_ICEDisabled (iceTimeout=0)
- TwoTimers_RelayDisabled (relayTimeout=0)
- TwoTimers_BothDisabled (both 0 = manager inert)

Plus Phase1_LazyEquivalence proves NewManager-call-site behaviour is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MichaelUray and others added 28 commits May 6, 2026 16:15
Phase 3 of netbirdio#5989. Each new Conn now receives the resolved cap via
ConnConfig. Conn.Open() then initializes its iceBackoffState. The
sentinel-translation (uint32-max -> 0) happens inside Conn so the
engine-level code stays simple.

Refs netbirdio#5989 (Phase 3).
Phase 3 (E1) of netbirdio#5989. Full pipeline implemented:
- daemon.proto: add iceBackoffFailures/NextRetry/Suspended fields to PeerState (field IDs 20-22)
- client/internal/peer/status.go: add IceBackoff* fields to State struct + UpdatePeerIceBackoff() method
- client/internal/peer/conn.go: push snapshot to statusRecorder after onICEFailed / onICEConnected
- client/status/status.go: wire fields through ToProtoFullStatus(), PeerStateDetailOutput, parsePeers()
- parsePeers() appends "ICE backoff: suspended for ..." line only when suspended
- client/status/status_test.go: update JSON/YAML expectations for new fields

When not suspended, the line is omitted (no noise on healthy peers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of netbirdio#5989 follow-up. The PeerState.IceBackoffSuspended flag is
only refreshed on ICE state-change events (markFailure / markSuccess),
so it stays true even after the suspension window has elapsed. The
status detail-output now also wall-clock-checks IceBackoffNextRetry
before printing the line, avoiding lines like "suspended for -1m45s"
between the expiry and the next event-driven snapshot push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RetryMaxSeconds proto changes

Picked-up commits in this branch added new fields to daemon.proto and
management.proto (ConnectionMode enum, P2pTimeoutSeconds,
RelayTimeoutSeconds, P2pRetryMaxSeconds, ICE-backoff fields). The
generated .pb.go files in the original Phase-3.7i branch were left
out-of-date during the cherry-pick; regenerated here so the package
builds.
…uest test

setconfig_test.go reflection-tests (TestSetConfig_AllFieldsSaved +
TestCLIFlags_MappedToSetConfig) tolerate the four new fields:
ConnectionMode, P2PTimeoutSeconds, RelayTimeoutSeconds,
P2PRetryMaxSeconds. They are in the proto so daemons can advertise
them via GetConfig, but SetConfig RPC does not apply them at runtime
today — the CLI sets them via 'netbird service install/reconfigure'
writing the active profile file directly. Listed in the test maps so
the structural drift-detector passes; wiring through SetConfig is a
follow-up.
The pb.go regeneration on dev machines uses local protoc (v5.29.3 here)
which causes the Proto Version Check workflow to flag the diff as a
toolchain mismatch. Restore the upstream-main version strings (v6.33.1
for client/proto, v7.34.1 for shared/management/proto) — generated
content matches; only the comment-line differs.
…rsion pin

- conn_mgr.go: tagged switch on uint32 sentinel (QF1002 staticcheck)
- engine.go: //nolint:staticcheck on the deprecated peer.IsForceRelayed
  call site - the function is intentionally retained for Phase-1
  backwards compat with daemons that haven't migrated to ResolveModeFromEnv yet
- lazyconn/manager.go: //nolint:staticcheck on the deprecated
  inactivity.NewManager fallback - kept on purpose for callers without
  resolved two-timer config (iceTimeout=0 && relayTimeout=0)
- proxy_service.pb.go: pin protoc version header to upstream-main v7.34.1
  (Proto Version Check workflow flagged it; only the comment-line differs)
…ebug bundle

Upstream's debug-bundle test TestAddConfig_AllFieldsCovered (added in
netbirdio#6071) reflection-checks every Config field is either
rendered into the bundle or in the excluded map. The four Phase 1+2+3
fields introduced in this PR — ConnectionMode, RelayTimeoutSeconds,
P2pTimeoutSeconds, P2pRetryMaxSeconds — must therefore be rendered.
Listed at the end of addCommonConfigFields right after the existing
LazyConnectionEnabled line, in the same key:value format the test
matches against.
Phase 3.5 hotfix for netbirdio#5989. When Guard detects that the signal/relay
layer has reconnected (typically after a network/interface change
like LTE-modem replug or WiFi-roaming), the per-Conn iceBackoff is
now reset to a fresh state AND the workerICE is recreated.

Without this, after a long Phase-2-style suspension the daemon would
remain stuck on Relay even when the new network conditions allow P2P:
the AttachICE-gate is open (suspend window expired) but the previously
DetachICE-closed workerICE never gets a fresh pion-agent, producing
"ICE Agent is not initialized yet" warnings instead of new ICE
attempts.

Verified on badmitterndorf-r1 (LTE-replug -> public IP): without
this fix all peers stayed Relayed; after daemon-restart 3/5 peers
came up as P2P. With this fix the same transition is automatic.

Spec section 5.1 listed Interface-Change as a Reset-Trigger but the
implementation was missed in the original Phase-3 plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ow-up)

The Phase-3.5 hotfix recreated workerICE on srReconnect but only
swapped the handshaker listener; no offer was sent. Result: pion's
agent stays nil until a peer-initiated offer arrives. HW-verified
on badmitterndorf via 144x "ICE Agent is not initialized yet"
warnings post-bounce with no peer transitioning back to P2P.

Now we explicitly call handshaker.SendOffer() after the listener
swap so the remote responds, the local OnNewOffer fires, and
reCreateAgent populates the new agent.

Refs netbirdio#5989 (Phase 3.5).
…gent

Phase 3.6 hotfix for netbirdio#5989. OnRemoteCandidate previously dropped any
candidate that arrived while w.agent was nil (= during the small race
window between OnNewOffer kicking off reCreateAgent and the agent
being assigned). After Phase-3.5 introduced in-process workerICE
recreate on network change, this race becomes much more frequent --
the remote peer often replies with candidates before our local
OnNewOffer handler has finished assigning the new agent.

Buffer up to 64 candidates per peer, drained inside OnNewOffer right
after w.agent is set. Buffer overflow logs a warning but is bounded
to prevent OOM under a misbehaving peer.

Verified: pre-fix on badmitterndorf showed ~150 "ICE Agent is not
initialized yet" warnings per peer after a network bounce, with no
P2P recovery within several minutes. After this fix the candidates
survive the race and pion can complete the pair-checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…recreate path

Phase 3.5 originally closed workerICE and replaced the wrapper via
NewWorkerICE() + listener-swap. Empirically this caused ICE to fail
with a ~13s pair-check timeout after a network event (LTE-glitch on
badmitterndorf), while a fresh daemon-restart converged in <1s on
the exact same network conditions. Most likely cause: state-leak
between the old and new workerICE wrappers (sockets, stdnet bindings)
that prevented the new pion agent from gathering working candidates.

Refactored to reuse the existing workerICE wrapper. The well-tested
"existing agent + new offer with different sessionID -> tear down +
reCreateAgent" branch already in worker_ice.go (Phase 1 code) is the
blessed path for in-place recreate. We just:
  1) Reset iceBackoff (counter -> 0)
  2) Close the current pion agent (w.agent = nil)
  3) handshaker.SendOffer() so the remote responds with a fresh
     sessionID that flows back through the still-attached listener

The buffer/race fix that netbirdio#5805 originally proposed and pappz argued
against is NOT needed here -- the warnings happen in both fresh-start
(P2P succeeds) and recreate (P2P fails), proving the warnings are
benign noise that pion handles via retransmits. The actual bug was
in the wrapper-replacement logic of the original Phase-3.5 commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous Phase-3.7 hotfix called handshaker.SendOffer() from
onNetworkChange. But Guard already drives sendOffer via its
newReconnectTicker (800ms initial, ~4 retries in the first 6s)
right after the same srReconnect event that fires our callback.

Both paths firing produces ~5 sendOffer calls per peer in ~6s.
Each offer routes to the remote's workerICE.OnNewOffer where (if
the local sessionID has changed since the last reCreateAgent) it
triggers a fresh tear-down + reCreateAgent. With 5 such triggers
in rapid succession, the remote ICE-pair-checks never complete --
they get torn down before pion's GatherCandidates / Dial can finish.

Empirically observed on badmitterndorf during real LTE-carrier
glitches: peers got 5 sending-offer log lines in 6s, then no
"set ICE to active" event for several minutes -> stayed Relayed.

Fix: do nothing about offers in onNetworkChange. Just close the
agent (w.agent = nil). The Guard's natural reconnect-burst drives
the next sendOffer, the remote replies, and our existing
"agent==nil + new offer -> reCreateAgent" path handles the
recreation cleanly.

Refs netbirdio#5989 (Phase 3.7 follow-up).
… (Guard-Loop Fix from netbirdio#5805)

Phase 3.7c (netbirdio#5989) re-introduces the Guard-Loop Fix from MichaelUray's
PR netbirdio#5805 that pappz did not accept (he believed sessionID-skip was
sufficient). Empirically on badmitterndorf during LTE-carrier
instability we now have hard evidence the bug is real cross-NAT too:

uray-mic-dh received 5 different sessionIDs from the remote peer in
2 minutes (06:27-06:29). Each different sessionID triggered the
in-place "tear-down + reCreateAgent" branch in OnNewOffer. With
both sides' Guards firing fresh offers in lockstep, the in-flight
ICE pair-checks (5-10s) never completed -- each new offer tore
down the previous attempt before pion could finish.

Fix: when agentConnecting==true, ignore any new offer regardless of
sessionID. Let the in-flight attempt either succeed (-> agent stays)
or fail (-> agentConnecting goes false naturally via closeAgent),
THEN allow a fresh sessionID to trigger reCreateAgent.

Refs netbirdio#5805 (Guard-Loop Fix), Phase 3.7c of netbirdio#5989.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In p2p-dynamic mode the runDynamicInactivityLoop calls DetachICEForPeer
when the per-peer iceTimeout fires, but does NOT call
lazyConnMgr.DeactivatePeer. The lazy state therefore stays at
watcherInactivity while the handshaker's iceListener is gone -- a
sub-state the lazy manager does not represent.

Before this change, ConnMgr.ActivatePeer gated AttachICE behind
lazyConnMgr.ActivatePeer's "found" flag, which is a one-shot edge
(watcherActivity -> watcherInactivity). After an iceTimeout-detach,
subsequent signal messages would call ActivatePeer with found=false,
AttachICE would never run, and remote OFFERs would silently drop at
handshaker.Listen():143 (iceListener==nil). The peer was stuck on
relay forever even while both sides kept signaling normally.

Symptom observed on bm router (uray-mic-dh peer): no "set ICE to
active" event for 22+ minutes, six different remote session IDs
received, zero ICE attempts started. Recovery was only possible via
daemon restart.

Fix:
1. ConnMgr.ActivatePeer: call AttachICE unconditionally for
   p2p-dynamic on every signal trigger. AttachICE is idempotent
   (returns nil if listener already attached, conn.go:1048) and
   honors iceBackoff.IsSuspended() so the failure-backoff is not
   bypassed.

2. handshaker.Listen: emit a Debug-level note when an OFFER/ANSWER
   arrives without an attached ICE listener (typical during ICE
   backoff suspension or relay-forced mode), so the dispatch state
   is observable in debug logs without alarming on healthy systems.

Verified live on bm router post-deploy:
- uray-mic-dh recovered to P2P (srflx/srflx) within seconds of
  restart, confirming the AttachICE flow runs end-to-end.
- ctb50-d (V7iprtVU... peer) cycles correctly: ICE failure ->
  backoff suspend -> backoff expire -> next signal triggers
  AttachICE -> ICE retry -> backoff doubles. Exact Phase-3
  behavior, now also working post-iceTimeout-detach.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7e)

Phase 3.7d ensured AttachICE runs on every signal trigger via
ConnMgr.ActivatePeer, but that fix only takes effect when the next
signal arrives. After an LTE-modem replug the iceListener can end up
detached for some peers (paths via onICEFailed -> DetachICE during the
bounce window, plus concurrent state-change callbacks while the agent
is being torn down). Until the next signal triggers AttachICE, every
remote OFFER reaches handshaker.Listen() with iceListener==nil and is
silently dropped. In one observed case ygW6ySPb stayed on relay for
5 minutes after the bounce despite uray-mic-dh sending nine OFFERs
with fresh sessionIDs; recovery only happened after a manual daemon
restart.

Phase 3.5's onNetworkChange already runs once per peer right after
srReconnect: it resets iceBackoff and closes the workerICE agent. This
is the natural place to also force the iceListener back on, so the
Guard's reconnect-loop can drive a fresh sendOffer that actually
reaches OnNewOffer instead of being dropped.

Refactor:
- Extract attachICEListenerLocked from AttachICE. The locked helper
  returns true when a new attachment was made and false on no-op
  (already attached, ICE backoff suspended, handshaker not initialised,
  or workerICE not present).
- AttachICE checks iceBackoff.IsSuspended first to preserve the
  existing relay-forced-mode error semantics in
  TestConn_AttachICE_NoOpWhenSuspended, then delegates to
  attachICEListenerLocked, then sends an offer if a new attachment was
  made.
- onNetworkChange calls attachICEListenerLocked after closing the
  workerICE agent. Deliberately does NOT call SendOffer because the
  Guard reconnect-ticker already issues one right after the same
  srReconnect event; sending another here would re-introduce the
  offer-storm Phase 3.7b removed.

Verified live on bm router with a 90 s management+signal blackhole:
- All 6 peers logged "ICE state reset on network change (agent closed;
  listener re-armed; Guard will resend offer)" within 1 ms of
  srReconnect.
- Peers whose listener had been detached (X+HhIybX, Rkqv) logged
  "ICE listener attached (locked path)" from the new helper.
- ygW6ySPb (uray-mic-dh), yhM26jA (dk20), dxvaVD2 (w11-test1) all
  reached "set ICE to active connection" within 75-85 s of the
  blackhole ending. Previously ygW6ySPb required a daemon restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hange (Phase 3.7f)

After Reset() is called for an srReconnect / network-change event,
the first ICE pair-check often fails because pion is still working
with NAT mappings that were torn down by the network event. Falling
back to the normal 1-minute initial exponential interval after that
single failure leaves the peer on relay for 60-100 seconds, far
longer than the underlying connectivity actually warrants - the
second attempt typically succeeds within a few seconds because the
new LTE/Wi-Fi mapping is now warm.

Behaviour change:
- Track lastResetAt in iceBackoffState; Reset() stamps it.
- Inside markFailure, while time.Since(lastResetAt) <
  networkChangeGracePeriod (30 s), use a fixed networkChangeRetryDelay
  of 5 s and do NOT advance the long-term exponential schedule.
- Outside the grace window, behaviour is unchanged: normal exponential
  backoff capped at maxBackoff.

Verified live on bm router with a 60 s management+signal blackhole
(equivalent to an LTE replug):

  Pre-3.7f real LTE bounce (08:03): uray-mic-dh stayed on relay for
  103 s before P2P recovered. ICE failure netbirdio#1 -> 60 s+ exponential
  suspend -> ICE retry -> success.

  Post-3.7f 60 s blackhole (08:18): srReconnect at 08:18:46.672,
  ICE failure netbirdio#1 at 08:18:59 -> "suspending for 5s, next retry at
  08:19:04", then "ICE success, resetting backoff (was 1 failures)"
  at 08:19:09. Total relay-only window: 23 s (4.5x faster).

  All four P2P-capable peers (uray-mic-d4, uray-mic-dh, dk20,
  w11-test1) reconverged within the same 23 s window.

Tests:
- TestIceBackoff_GracePeriodAfterReset_ShortDelay: first and second
  failure within the grace window both return networkChangeRetryDelay
  and do not advance the underlying ExponentialBackOff.
- TestIceBackoff_GraceExpired_NormalExponential: forcing lastResetAt
  into the past restores the ~1 m initial exponential delay.
- TestIceBackoff_NoGraceWithoutReset: a fresh state without an
  explicit Reset uses the normal exponential schedule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nnected (Phase 3.7g)

Phase 3.5 added workerICE.Close to onNetworkChange so that the ICE
agent gets recreated after an LTE-modem replug or other network event.
That's correct when the underlying peer-to-peer UDP path actually
broke -- pion's ICE state machine has already gone Disconnected/Failed
and w.agent has been cleared by closeAgent in onConnectionStateChange.

But many srReconnect events are NOT real connectivity losses. A brief
signal/relay outage (gRPC keepalive timeout, transient IP-blacklist,
DNS hiccup) fires srReconnect even though peer-to-peer WG keepalives
between clients kept flowing the whole time. Closing the still-working
ICE agent in that case forces:

  - workerICE.Close clears w.agent and triggers ICE.Failed callbacks
  - onICEFailed marks a backoff failure and calls DetachICE
  - WireGuard endpoint is removed (~1 s ping dropout)
  - Next signal triggers OnNewOffer -> reCreateAgent (~12 s pair-check)
  - Total observable interruption: 15-25 s for an event that should
    have been a no-op

Verified live on the second badmitterndorf router (172A2, wired to
home LAN, identical netbird config except runs on stable connectivity):

  Pre-3.7g (Phase 3.7f): a 60 s mgmt blackhole caused all four
  P2P-capable peers to log "ICE disconnected, do not switch to Relay
  ... configure WireGuard endpoint to ..." and a full 21 s
  ICE-renegotiation cycle even though each peer's WG endpoint ended
  up at the SAME address it had before.

  Post-3.7g: same 60 s mgmt blackhole, debug log shows
  "network change: skipping workerICE.Close (ICE still Connected,
  soft-fallback)" for every healthy peer. Zero state changes in
  netbird status, zero ICE failures, zero ping dropout.

Implementation:
- WorkerICE.IsConnected returns true when w.agent != nil and
  lastKnownState == ice.ConnectionStateConnected. Reads the same
  state machine that drives onConnectionStateChange, so it's
  authoritative.
- onNetworkChange wraps the workerICE.Close call in
  `if !workerICE.IsConnected()`. The LTE-bounce path is unchanged
  (pion has already cleared w.agent so IsConnected returns false
  and Close runs as before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 3.7h)

Real-world LTE-bounce traces showed Phase 3.7f's 30 s grace window
only fit ~2 ICE attempts before the schedule jumped to a 1-minute
exponential suspend. Each pair-check is ~12-15 s, the post-Reset
delay was 5 s, so by the second failure (T+~34s) we were already
outside the grace window. Cold NAT mappings often need 3-4 attempts
to prime, so peers behind a single LTE/Wi-Fi NAT routinely waited
2-3 minutes for P2P recovery instead of the ~30-50 s the underlying
connectivity actually allowed.

Widen the window from 30 s to 60 s and shorten the retry delay from
5 s to 2 s. With these values:

  T=0 s    Reset (srReconnect / network change)
  T=~12 s  Attempt netbirdio#1 fails (pion pair-check timeout)
  T=~14 s  Attempt netbirdio#2 starts
  T=~26 s  Attempt netbirdio#2 fails
  T=~28 s  Attempt netbirdio#3 starts
  T=~40 s  Attempt netbirdio#3 fails
  T=~42 s  Attempt netbirdio#4 starts (still inside 60 s grace)
  T=~54 s  Attempt netbirdio#4 fails / succeeds

Roughly twice as many priming attempts before exponential kicks in.
Observed bm-LTE peer that previously needed 2 min 8 s for recovery
should now converge inside the grace window.

Trade-off: chronically broken peers (behind a genuine symmetric NAT
on both sides) generate ~2x more signal/STUN traffic in the first
60 s after a network change. After the grace expires the schedule
falls back to the same exponential backoff as before, so the long-
term cost is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfigure

Headless setups today need two steps to seed the active profile with a
non-default connection mode:

  netbird service install ...
  netbird up --connection-mode p2p-dynamic --p2p-timeout 60 ...

Add the same four profile-level flags to the install + reconfigure
commands and persist them via profilemanager.UpdateOrCreateConfig so
the daemon picks them up on first start. Reuses the package-level
vars + constants already defined for `up`, so naming and validation
stay consistent.

Flags added (mirrored from upCmd):
  --connection-mode  relay-forced|p2p|p2p-lazy|p2p-dynamic|follow-server
  --relay-timeout    seconds (0 = use server default)
  --p2p-timeout      seconds (0 = use server default; only effective
                     in p2p-dynamic mode)
  --p2p-retry-max    seconds (0 = use server default; built-in 15 min
                     fallback when server has not pushed a value)

Only fields whose flag was Changed() are written; unset flags leave
the existing profile untouched. Honors --config when present,
otherwise uses profilemanager.DefaultConfigPath.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Android Advanced Settings UI in the Fyne-based desktop
client (Linux + Windows). The Network tab in NetBird Settings gains:

  - Connection Mode dropdown: Follow server, relay-forced, p2p,
    p2p-lazy, p2p-dynamic. "Follow server" clears any local override.
  - Relay Timeout (s): only meaningful in p2p-lazy / p2p-dynamic,
    disabled for other modes. Empty = use server default.
  - P2P Timeout (s): only meaningful in p2p-dynamic. Empty = use
    server default.
  - P2P Retry-Max (s): only meaningful in p2p-dynamic. Empty = use
    server default (or built-in 15-min fallback when the management
    server has not pushed a value).

The dropdown's onChange handler enables/disables the timeout entries
to match the inactivity-manager's actual scope (no inactivity teardown
runs in relay-forced / p2p, so those modes get all three fields
disabled).

Daemon plumbing:
  - GetConfigResponse gains connection_mode, p2p_timeout_seconds,
    relay_timeout_seconds, p2p_retry_max_seconds (proto + regenerated
    pb.go). server.go fills them from the active profile config so the
    GUI hydrates correctly on open.
  - SetConfigRequest already had these fields from Phase 1; the GUI's
    buildSetConfigRequest now populates them on every save.

State plumbing in client_ui.go:
  - 4 new fyne widgets (sConnectionMode + 3 entries) created in
    showSettingsUI alongside the existing checks.
  - 4 new state fields cached on serviceClient for change detection.
  - hasConnectionModeChanges() compares dropdown + entries against
    the cached state; saveSettings flow only sends a SetConfigRequest
    when there's a real diff.
  - parseUint32Field tolerates empty / non-numeric input (treated as
    0 = no override) so the user can clear a field without errors.

The lazy-connection checkbox in the systray menu is left in place;
it acts as a quick toggle for the legacy boolean and is kept for
backwards compatibility with profiles that have not yet adopted the
new ConnectionMode field.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. GetConfigResponse gets 4 server_pushed_* fields
(connection_mode + 3 timeouts in seconds). ConnMgr captures the raw
values from each NetworkMap.PeerConfig and exposes them via
ServerPushed*() accessors, all guarded by sync.RWMutex (covered by
TestConnMgr_ServerPushedFieldsAreRaceSafe). Engine.ConnMgr() getter
lets the daemon-RPC layer reach the values; Server.GetConfig fills
them into every response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3.7h finalisation. The Connection Mode dropdown's "Follow server"
entry now suffixes "(currently: <mode>)" when the engine has received a
PeerConfig; the timeout entries' placeholders show the actual server
default seconds. The redundant "Enable Lazy Connections" tray submenu
is removed since Connection Mode covers it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lues

Phase 3.7h finalisation. Adds Get/Set{ConnectionMode, RelayTimeoutSeconds,
P2pTimeoutSeconds, P2pRetryMaxSeconds} on Preferences and the matching
GetServerPushed* accessors on Client (via connMgrSafe helper). Required
by the Android Connection-Mode picker (in netbird-android repo).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MichaelUray MichaelUray force-pushed the pr/b-phase3.5-network-change branch from 025a750 to 3730df0 Compare May 6, 2026 16:16
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
7 New issues
7 New Code Smells (required ≤ 0)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant